textsplitter

package
v0.0.0-...-1ea5cf0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 28, 2023 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Language model has token limit, managing text/prompt to fit the limit is easier with text splitter. Text splitter will split text into chunk of text with max size defined maxChunkSize.

There are 2 types of text splitter available : 1. Word splitter Split the text word by word and make sure the chunk size is not exceed the maxChunkSize. The maxChunkSize is in character. 2. Tiktoken splitter (I think it should be called tiktoken word splitter) Split the text word by word and make sure the chunk size is not exceed the maxChunkSize. The maxChunkSize is according to tiktoken definition of token.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type TextSplitter

type TextSplitter interface {
	SplitText(input string, maxChunkSize int, overlap int) []string
	SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document
	Len(input string) int
}

TextSplitter split text into chunk of text

type TextSplitterMock

type TextSplitterMock struct {
	// LenFunc mocks the Len method.
	LenFunc func(input string) int

	// SplitDocumentFunc mocks the SplitDocument method.
	SplitDocumentFunc func(input document.Document, maxChunkSize int, overlap int) []document.Document

	// SplitTextFunc mocks the SplitText method.
	SplitTextFunc func(input string, maxChunkSize int, overlap int) []string
	// contains filtered or unexported fields
}

TextSplitterMock is a mock implementation of TextSplitter.

func TestSomethingThatUsesTextSplitter(t *testing.T) {

	// make and configure a mocked TextSplitter
	mockedTextSplitter := &TextSplitterMock{
		LenFunc: func(input string) int {
			panic("mock out the Len method")
		},
		SplitDocumentFunc: func(input document.Document, maxChunkSize int, overlap int) []document.Document {
			panic("mock out the SplitDocument method")
		},
		SplitTextFunc: func(input string, maxChunkSize int, overlap int) []string {
			panic("mock out the SplitText method")
		},
	}

	// use mockedTextSplitter in code that requires TextSplitter
	// and then make assertions.

}

func (*TextSplitterMock) Len

func (mock *TextSplitterMock) Len(input string) int

Len calls LenFunc.

func (*TextSplitterMock) LenCalls

func (mock *TextSplitterMock) LenCalls() []struct {
	Input string
}

LenCalls gets all the calls that were made to Len. Check the length with:

len(mockedTextSplitter.LenCalls())

func (*TextSplitterMock) SplitDocument

func (mock *TextSplitterMock) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document

SplitDocument calls SplitDocumentFunc.

func (*TextSplitterMock) SplitDocumentCalls

func (mock *TextSplitterMock) SplitDocumentCalls() []struct {
	Input        document.Document
	MaxChunkSize int
	Overlap      int
}

SplitDocumentCalls gets all the calls that were made to SplitDocument. Check the length with:

len(mockedTextSplitter.SplitDocumentCalls())

func (*TextSplitterMock) SplitText

func (mock *TextSplitterMock) SplitText(input string, maxChunkSize int, overlap int) []string

SplitText calls SplitTextFunc.

func (*TextSplitterMock) SplitTextCalls

func (mock *TextSplitterMock) SplitTextCalls() []struct {
	Input        string
	MaxChunkSize int
	Overlap      int
}

SplitTextCalls gets all the calls that were made to SplitText. Check the length with:

len(mockedTextSplitter.SplitTextCalls())

type TikTokenSplitter

type TikTokenSplitter struct {
	// contains filtered or unexported fields
}

func NewTikTokenSplitter

func NewTikTokenSplitter(modelName string) (*TikTokenSplitter, error)

NewTikTokenSplitter create new TikTokenSplitter instance if modelName empty, the default one is gpt-3.5-turbo-0301

func (*TikTokenSplitter) Len

func (T *TikTokenSplitter) Len(input string) int

func (*TikTokenSplitter) SplitDocument

func (T *TikTokenSplitter) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document

SplitDocument creates chunk where length's doesn't exceed maxChunkSize. the document metadata will be copied to each chunk

func (*TikTokenSplitter) SplitText

func (T *TikTokenSplitter) SplitText(input string, maxChunkSize int, overlap int) []string

SplitText creates chunks where length's doesn't exceed maxChunkSize.

type WordSplitter

type WordSplitter struct {
}

func (*WordSplitter) Len

func (W *WordSplitter) Len(input string) int

func (*WordSplitter) SplitDocument

func (W *WordSplitter) SplitDocument(input document.Document, maxChunkSize int, overlap int) []document.Document

SplitDocument creates chunk where length's doesn't exceed maxChunkSize. the document metadata will be copied to each chunk

func (*WordSplitter) SplitText

func (W *WordSplitter) SplitText(input string, maxChunkSize int, overlap int) []string

splitIntoBatches creates word batches where length's doesn't exceed maxChunkSize.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL