vectorizer

package

v0.0.14 Latest Latest Go to latest Published: Mar 4, 2026 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/happyhackingspace/dit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package vectorizer provides text vectorization utilities matching sklearn behavior.

Index ¶

func EnglishStopWords() map[string]bool
type CountVectorizer
- func NewCountVectorizer(ngramRange [2]int, binary bool, analyzer string, minDF int) *CountVectorizer
type DictVectorizer
- func NewDictVectorizer() *DictVectorizer
type SparseVector
- func ConcatSparse(vectors []SparseVector) SparseVector
- func NewSparseVector(dim int) SparseVector
type TfidfVectorizer
- func NewTfidfVectorizer(ngramRange [2]int, minDF int, binary bool, analyzer string, ...) *TfidfVectorizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EnglishStopWords ¶

func EnglishStopWords() map[string]bool

EnglishStopWords returns sklearn's default English stop words set.

Types ¶

type CountVectorizer ¶

type CountVectorizer struct {
	Vocabulary map[string]int `json:"vocabulary"`
	NgramRange [2]int         `json:"ngram_range"`
	Binary     bool           `json:"binary"`
	Analyzer   string         `json:"analyzer"` // "word" or "char_wb"
	MinDF      int            `json:"min_df"`
}

CountVectorizer converts text to token count vectors.

func NewCountVectorizer ¶

func NewCountVectorizer(ngramRange [2]int, binary bool, analyzer string, minDF int) *CountVectorizer

NewCountVectorizer creates a CountVectorizer with default settings.

func (*CountVectorizer) Fit ¶

func (cv *CountVectorizer) Fit(corpus []string)

Fit builds the vocabulary from a corpus.

func (*CountVectorizer) FitTransform ¶

func (cv *CountVectorizer) FitTransform(corpus []string) []SparseVector

FitTransform fits the vocabulary and transforms the corpus.

func (*CountVectorizer) MarshalJSON ¶

func (cv *CountVectorizer) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler.

func (*CountVectorizer) Transform ¶

func (cv *CountVectorizer) Transform(text string) SparseVector

Transform converts a single document to a sparse vector.

func (*CountVectorizer) UnmarshalJSON ¶

func (cv *CountVectorizer) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler.

func (*CountVectorizer) VocabSize ¶

func (cv *CountVectorizer) VocabSize() int

VocabSize returns the vocabulary size.

type DictVectorizer ¶

type DictVectorizer struct {
	FeatureNames []string       `json:"feature_names"`
	FeatureIndex map[string]int `json:"feature_index"`
}

DictVectorizer converts feature dicts to sparse vectors.

func NewDictVectorizer ¶

func NewDictVectorizer() *DictVectorizer

NewDictVectorizer creates an empty DictVectorizer.

func (*DictVectorizer) Fit ¶

func (dv *DictVectorizer) Fit(data []map[string]any)

Fit builds the feature mapping from a list of feature dicts.

func (*DictVectorizer) FitTransform ¶

func (dv *DictVectorizer) FitTransform(data []map[string]any) []SparseVector

FitTransform fits and transforms the data.

func (*DictVectorizer) Transform ¶

func (dv *DictVectorizer) Transform(d map[string]any) SparseVector

Transform converts a feature dict to a sparse vector.

func (*DictVectorizer) VocabSize ¶

func (dv *DictVectorizer) VocabSize() int

VocabSize returns the number of features.

type SparseVector ¶

type SparseVector struct {
	Indices []int
	Values  []float64
	Dim     int
}

SparseVector represents a sparse float64 vector.

func ConcatSparse ¶

func ConcatSparse(vectors []SparseVector) SparseVector

ConcatSparse concatenates multiple sparse vectors with offsets into a single vector.

func NewSparseVector ¶

func NewSparseVector(dim int) SparseVector

NewSparseVector creates a sparse vector with given dimension.

func (SparseVector) Dot ¶

func (sv SparseVector) Dot(dense []float64) float64

Dot computes the dot product with a dense vector.

func (SparseVector) L2Norm ¶

func (sv SparseVector) L2Norm() float64

L2Norm returns the L2 norm of the sparse vector.

func (SparseVector) Nnz ¶

func (sv SparseVector) Nnz() int

Nnz returns the number of non-zero entries.

func (*SparseVector) Set ¶

func (sv *SparseVector) Set(idx int, val float64)

Set adds or updates a value at the given index.

func (SparseVector) ToDense ¶

func (sv SparseVector) ToDense() []float64

ToDense converts to a dense float64 slice.

type TfidfVectorizer ¶

type TfidfVectorizer struct {
	CountVec  *CountVectorizer `json:"count_vec"`
	IDF       []float64        `json:"idf"`
	StopWords map[string]bool  `json:"stop_words,omitempty"`
}

TfidfVectorizer converts text to TF-IDF weighted vectors. Uses binary=true mode (matching Formasaurus): value = IDF[term] if present, 0 otherwise.

func NewTfidfVectorizer ¶

func NewTfidfVectorizer(ngramRange [2]int, minDF int, binary bool, analyzer string, stopWords map[string]bool) *TfidfVectorizer

NewTfidfVectorizer creates a TfidfVectorizer.

func (*TfidfVectorizer) Fit ¶

func (tv *TfidfVectorizer) Fit(corpus []string)

Fit computes IDF values from a corpus.

func (*TfidfVectorizer) FitTransform ¶

func (tv *TfidfVectorizer) FitTransform(corpus []string) []SparseVector

FitTransform fits and transforms the corpus.

func (*TfidfVectorizer) Transform ¶

func (tv *TfidfVectorizer) Transform(text string) SparseVector

Transform converts a single document to a TF-IDF sparse vector.

func (*TfidfVectorizer) VocabSize ¶

func (tv *TfidfVectorizer) VocabSize() int

VocabSize returns the vocabulary size.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL