Documentation
¶
Overview ¶
Language model has token limit, managing text/prompt to fit the limit is easier with text splitter. Text splitter will split text into chunk of text with max size defined maxChunkSize.
There are 2 types of text splitter available : 1. Word splitter Split the text word by word and make sure the chunk size is not exceed the maxChunkSize. The maxChunkSize is in character. 2. Tiktoken splitter (I think it should be called tiktoken word splitter) Split the text word by word and make sure the chunk size is not exceed the maxChunkSize. The maxChunkSize is according to tiktoken definition of token.
Index ¶
- type TextSplitter
- type TextSplitterMock
- func (mock *TextSplitterMock) Len(input string) int
- func (mock *TextSplitterMock) LenCalls() []struct{ ... }
- func (mock *TextSplitterMock) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document
- func (mock *TextSplitterMock) SplitDocumentCalls() []struct{ ... }
- func (mock *TextSplitterMock) SplitText(input string, maxChunkSize int, overlap int) []string
- func (mock *TextSplitterMock) SplitTextCalls() []struct{ ... }
- type TikTokenSplitter
- type WordSplitter
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type TextSplitter ¶
type TextSplitter interface { SplitText(input string, maxChunkSize int, overlap int) []string SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document Len(input string) int }
TextSplitter split text into chunk of text
type TextSplitterMock ¶
type TextSplitterMock struct { // LenFunc mocks the Len method. LenFunc func(input string) int // SplitDocumentFunc mocks the SplitDocument method. SplitDocumentFunc func(input document.Document, maxChunkSize int, overlap int) []document.Document // SplitTextFunc mocks the SplitText method. SplitTextFunc func(input string, maxChunkSize int, overlap int) []string // contains filtered or unexported fields }
TextSplitterMock is a mock implementation of TextSplitter.
func TestSomethingThatUsesTextSplitter(t *testing.T) { // make and configure a mocked TextSplitter mockedTextSplitter := &TextSplitterMock{ LenFunc: func(input string) int { panic("mock out the Len method") }, SplitDocumentFunc: func(input document.Document, maxChunkSize int, overlap int) []document.Document { panic("mock out the SplitDocument method") }, SplitTextFunc: func(input string, maxChunkSize int, overlap int) []string { panic("mock out the SplitText method") }, } // use mockedTextSplitter in code that requires TextSplitter // and then make assertions. }
func (*TextSplitterMock) Len ¶
func (mock *TextSplitterMock) Len(input string) int
Len calls LenFunc.
func (*TextSplitterMock) LenCalls ¶
func (mock *TextSplitterMock) LenCalls() []struct { Input string }
LenCalls gets all the calls that were made to Len. Check the length with:
len(mockedTextSplitter.LenCalls())
func (*TextSplitterMock) SplitDocument ¶
func (mock *TextSplitterMock) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document
SplitDocument calls SplitDocumentFunc.
func (*TextSplitterMock) SplitDocumentCalls ¶
func (mock *TextSplitterMock) SplitDocumentCalls() []struct { Input document.Document MaxChunkSize int Overlap int }
SplitDocumentCalls gets all the calls that were made to SplitDocument. Check the length with:
len(mockedTextSplitter.SplitDocumentCalls())
func (*TextSplitterMock) SplitText ¶
func (mock *TextSplitterMock) SplitText(input string, maxChunkSize int, overlap int) []string
SplitText calls SplitTextFunc.
func (*TextSplitterMock) SplitTextCalls ¶
func (mock *TextSplitterMock) SplitTextCalls() []struct { Input string MaxChunkSize int Overlap int }
SplitTextCalls gets all the calls that were made to SplitText. Check the length with:
len(mockedTextSplitter.SplitTextCalls())
type TikTokenSplitter ¶
type TikTokenSplitter struct {
// contains filtered or unexported fields
}
func NewTikTokenSplitter ¶
func NewTikTokenSplitter(modelName string) (*TikTokenSplitter, error)
NewTikTokenSplitter create new TikTokenSplitter instance if modelName empty, the default one is gpt-3.5-turbo-0301
func (*TikTokenSplitter) Len ¶
func (T *TikTokenSplitter) Len(input string) int
func (*TikTokenSplitter) SplitDocument ¶
func (T *TikTokenSplitter) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document
SplitDocument creates chunk where length's doesn't exceed maxChunkSize. the document metadata will be copied to each chunk
type WordSplitter ¶
type WordSplitter struct { }
func (*WordSplitter) Len ¶
func (W *WordSplitter) Len(input string) int
func (*WordSplitter) SplitDocument ¶
func (W *WordSplitter) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document
SplitDocument creates chunk where length's doesn't exceed maxChunkSize. the document metadata will be copied to each chunk