fasttext

package module
v0.0.0-...-7a691b4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2020 License: MIT Imports: 9 Imported by: 4

README

go-fasttext

Build Status

Usage Guide and API Documentation

This package provides a Go API for the Facebook's fastText dataset for word embeddings, with data stored in a persistent SQLite database.

Documentation

Overview

Package fasttext provides a simple wrapper for Facebook fastText dataset (https://fasttext.cc/docs/en/crawl-vectors.html). It allows fast look-up of word embeddings from persistent data store (SQLite3).

Installation

go get -u github.com/ekzhu/go-fasttext

After downloading a .vec data file from the fastText project, you can initialize the SQLite3 database (in your code):

import (
	_ "github.com/mattn/go-sqlite3"
	"github.com/ekzhu/go-fasttext"
)
...
ft := fasttext.NewFastText("/path/to/sqlite3/file")
vecFile, err := os.Open("/path/to/word/embedding/.vec/file")
err = ft.BuildDB(vecFile)

This will create a new file on your disk for the SQLite3 database. Once the above step is finished, you can start looking up word embeddings (in your code):

emb, err := ft.GetEmb("king")
if err != nil {
	fmt.Println(err)
}
fmt.Println(emb)

Each word embedding vector is a slice of float32 with length of 300.

Note that you only need to initialize the SQLite3 database once. The next time you use it you can skip the call to BuildDB.

For faster querying during runtime, you can use an in-memory database.

ft := NewFastTextInMem("/path/to/sqlite3/file")

This creates an in-memory SQLite3 database which is a copy of the on-disk one. Using the in-memory version makes query time much faster, but takes a few minutes to load the database.

Index

Constants

View Source
const (
	// TableName used in SQLite3
	TableName = "fasttext"
	// Dim is the number of dimensions in FastText word embedding vectors
	Dim = 300
)

Variables

View Source
var (
	// ErrNoEmbFound ...
	ErrNoEmbFound = errors.New("No embedding found for the given word")
	// ByteOrder is for the serialization of the embedding vector in
	// SQLite3 database.
	ByteOrder = binary.BigEndian
)

Functions

This section is empty.

Types

type FastText

type FastText struct {
	// contains filtered or unexported fields
}

The FastText session. In multi-thread setting, each thread must have its own copy of FastText session. A single FastText session cannot be shared among multiple threads.

func NewFastText

func NewFastText(dbFilename string) *FastText

NewFastText starts a new FastText session given the location of the SQLite3 database file.

func NewFastTextInMem

func NewFastTextInMem(dbFilename string) *FastText

NewFastTextInMem creates a new FastText session that uses an in-memory database for faster query time. The on-disk SQLite3 database (given by dbFilename) will be loaded into an in-memory SQLite3 database in this function, which will take a few miniutes to finish.

func (*FastText) BuildDB

func (ft *FastText) BuildDB(wordEmbFile io.Reader) error

BuildDB initializes the SQLite3 database by importing the word embeddings from the .vec file downloaded from https://fasttext.cc/docs/en/crawl-vectors.html

func (*FastText) Close

func (ft *FastText) Close() error

Close must be called before finishing using this FastText session.

func (*FastText) GetEmb

func (ft *FastText) GetEmb(word string) ([]float32, error)

GetEmb returns the word embedding of the given word.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL