imdb

package

v0.23.0 Latest Latest Go to latest Published: Sep 21, 2025 License: Apache-2.0 Imports: 38 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

README ¶

IMDB Dataset of 50k Movie Reviews

Kaggle's IMDB Dataset of 50k Movie Reviews
Original dataset in https://ai.stanford.edu/~amaas/data/sentiment/

Downloaded from: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Documentation ¶

Overview ¶

Package imdb contains code to download and prepare datasets with IMDB Dataset of 50k Movie Reviews.

This can be used to train models, but this library has no library per se. See a demo model training in sub-package `demo`.

Index ¶

Constants
Variables
func BagOfWordsModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
func Conv1DModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
func CreateDefaultContext() *context.Context
func Download(baseDir string) error
func EmbedTokensGraph(ctx *context.Context, tokens *Node) (embed, mask *Node)
func InputToString(input *tensors.Tensor, batchIdx int) string
func LoadIndividualFiles(baseDir string) (vocab *Vocab, examples []*Example, err error)
func NormalizeSequence(ctx *context.Context, x *Node) *Node
func PrintSample(n int)
func TrainModel(ctx *context.Context, dataDir, checkpointPath string, paramsSet []string, ...)
func TransformerLayers(ctx *context.Context, embed, mask *Node) *Node
func TransformerModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
type Dataset
- func NewDataset(name string, dsType DatasetType, maxLen, batchSize int, infinite bool) *Dataset
- func NewUnsupervisedDataset(name string, maxLen, batchSize int, infinite bool) *Dataset
- func (ds *Dataset) Name() string
- func (ds *Dataset) Reset()
- func (ds *Dataset) Shuffle() *Dataset
- func (ds *Dataset) Yield() (spec any, inputs, labels []*tensors.Tensor, err error)
type DatasetType
type Example
- func NewExample(contents []byte, vocab *Vocab) *Example
- func (e *Example) String(vocab *Vocab) string
type TokenId
type Vocab
- func NewVocab() *Vocab
- func (v *Vocab) RegisterToken(token string) (idx TokenId)
- func (v *Vocab) SortByFrequency() (oldIDtoNewID map[TokenId]TokenId)
type VocabEntry

Constants ¶

View Source

const (
	DownloadURL  = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
	LocalTarFile = "aclImdb_v1.tar.gz"
	TarHash      = `c40f74a18d3b61f90feba1e17730e0d38e8b97c05fde7008942e91923d1658fe`
	LocalDir     = "aclImdb"
	BinaryFile   = "aclImdb.bin"
)

Variables ¶

View Source

var (
	// IncludeSeparators indicates whether when parsing files it should create tokens out of the
	// separators (commas, dots, etc).
	IncludeSeparators = false

	// CaseSensitive indicates whether token collection should be case-sensitive.
	CaseSensitive = false

	// LoadedVocab is materialized after calling Download.
	LoadedVocab *Vocab

	// LoadedExamples is materialized after calling Download. It is based on LoadedVocab.
	// Once loaded it should remain immutable.
	LoadedExamples []*Example
)

View Source

var (
	// ValidModels is the list of model types supported.
	ValidModels = map[string]train.ModelFn{
		"bow":         BagOfWordsModelGraph,
		"cnn":         Conv1DModelGraph,
		"transformer": TransformerModelGraph,
	}

	// ParamsExcludedFromLoading is the list of parameters (see CreateDefaultContext) that shouldn't be saved
	// along on the models checkpoints, and may be overwritten in further training sessions.
	ParamsExcludedFromLoading = []string{
		"data_dir", "train_steps", "num_checkpoints", "plots",
	}
)

View Source

var DType = dtypes.Float32

DType used in the mode.

Functions ¶

func BagOfWordsModelGraph ¶ added in v0.11.0

func BagOfWordsModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

BagOfWordsModelGraph builds the computation graph for the "bag of words" model: simply the sum of the embeddings for each token included.

func Conv1DModelGraph ¶ added in v0.11.0

func Conv1DModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

Conv1DModelGraph implements a convolution (1D) based model for the IMDB dataset.

func CreateDefaultContext ¶ added in v0.11.0

func CreateDefaultContext() *context.Context

CreateDefaultContext sets the context with default hyperparameters to use with TrainModel.

func Download ¶

func Download(baseDir string) error

Download IMDB reviews dataset to current directory, un-tar it, parses all individual files and saves the binary file version.

The vocabulary and examples loaded are set to LoadedVocab and LoadedExamples.

If it's already downloaded, simply load binary file version.

func EmbedTokensGraph ¶ added in v0.11.0

func EmbedTokensGraph(ctx *context.Context, tokens *Node) (embed, mask *Node)

EmbedTokensGraph creates embeddings for tokens and returns them along with the mask of used tokens -- set to false where padding was used.

- tokens: padded (at the start) tokens shaped (int32)[batch_size, content_len].

Outputs:

- embed: shaped [batch_size, content_len, <imdb_token_embedding_size>]. - mask: shaped (bool)[batch_size, content_len], indicates where tokens were padded.

func InputToString ¶

func InputToString(input *tensors.Tensor, batchIdx int) string

InputToString returns a string rendered content of one row (pointed to by batchIdx) of an input. The input is assumed to be a batch created by a Dataset object.

func LoadIndividualFiles ¶

func LoadIndividualFiles(baseDir string) (vocab *Vocab, examples []*Example, err error)

func NormalizeSequence ¶ added in v0.11.0

func NormalizeSequence(ctx *context.Context, x *Node) *Node

NormalizeSequence `x` according to "normalization" hyperparameter. Works for sequence nodes (rank-3).

func PrintSample ¶ added in v0.11.0

func PrintSample(n int)

PrintSample of n examples.

func TrainModel ¶ added in v0.11.0

func TrainModel(ctx *context.Context, dataDir, checkpointPath string, paramsSet []string, evaluateOnEnd bool, verbosity int)

TrainModel with hyperparameters given in ctx.

func TransformerLayers ¶ added in v0.11.0

func TransformerLayers(ctx *context.Context, embed, mask *Node) *Node

TransformerLayers builds the stacked transformer layers for the model.

func TransformerModelGraph ¶ added in v0.11.0

func TransformerModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

TransformerModelGraph is the part of the model that takes the word/token embeddings to a transformed embedding through attention ready to be pooled and read out.

Types ¶

type Dataset ¶

type Dataset struct {
	DatasetType      DatasetType
	MaxLen, MaxVocab int
	BatchSize        int

	ExamplesIndices []int32
	Pos             int
	Infinite        bool
	Shuffler        *rand.Rand
	// contains filtered or unexported fields
}

Dataset implements train.Dataset. It allows for concurrent Yield calls, so one can feed it to ParallelizedDataset.

Yield:

inputs[0] (TokenId)[batch_size, ds.MaxLen] will hold the first ds.MaxLen tokens of the example. If the example is shorter than that, the rest is filled with 0s at the start: meaning, the empty space comes first. If there is enough space, the first token will be 1 (the "<START>" token).
labels[0] (int8)[batch_size] labels are 0, 1 or 2 for negative/positive/unlabeled examples.

func NewDataset ¶

func NewDataset(name string, dsType DatasetType, maxLen, batchSize int, infinite bool) *Dataset

NewDataset creates a labeled Dataset. See Dataset for details.

- name passed along for debugging and metrics naming. - dsType: dataset of TypeTrain or TypeTest. - maxLen: max length of the content kept per test. Since GoMLX/XLA only works with fixed size tensors, the memory used grows linear with this. - infinite: data loops forever (if shuffler is set, it reshuffles at every end of epoch).

func NewUnsupervisedDataset ¶

func NewUnsupervisedDataset(name string, maxLen, batchSize int, infinite bool) *Dataset

NewUnsupervisedDataset with the DatasetType assumed to be TypeTrain.

func (*Dataset) Name ¶

func (ds *Dataset) Name() string

Name implements train.Dataset interface.

func (*Dataset) Reset ¶

func (ds *Dataset) Reset()

Reset restarts the dataset from the beginning. Can be called after io.EOF is reached, for instance when running another evaluation on a test dataset.

func (*Dataset) Shuffle ¶

func (ds *Dataset) Shuffle() *Dataset

Shuffle marks dataset to yield shuffled results.

func (*Dataset) Yield ¶

func (ds *Dataset) Yield() (spec any, inputs, labels []*tensors.Tensor, err error)

Yield implements train.Dataset interface. If not infinite, return io.EOF at the end of the dataset.

It trims the examples to ds.MaxLen tokens, taken from the end.

It returns `spec==nil` always, since `inputs` and `labels` have always the same type of content.

It can be called concurrently.

type DatasetType ¶ added in v0.11.0

type DatasetType int

DatasetType refers to either a train or test example(s).

const (
	TypeTrain DatasetType = iota
	TypeTest
)

type Example ¶

type Example struct {
	Set           DatasetType
	Label, Rating int8
	Length        int
	Content       []TokenId
}

Example encapsulates all the information of one example in the IMDB 50k dataset. The fields are:

Set can be 0 or 1 for "test", train".
Label is 0, 1 or 2 for negative/positive/unlabeled examples.
Rating is a value from 1 to 10 in imdb. For unlabeled examples they are marked all as 0.
Length is the length (in # of tokens) of the content.
Content are the tokens of the IMDB entry -- there should be a vocabulary associated to the dataset.

func NewExample ¶

func NewExample(contents []byte, vocab *Vocab) *Example

NewExample parses an IMDB content file, tokenize it using the given Vocab and returns the parsed example.

It doesn't fill the SetIdx, Label and Rating attributes.

func (*Example) String ¶

func (e *Example) String(vocab *Vocab) string

type TokenId ¶ added in v0.11.0

type TokenId = int32

type Vocab ¶

type Vocab struct {
	ListEntries []VocabEntry
	MapTokens   map[string]TokenId
	TotalCount  int64
}

Vocab stores vocabulary information for the whole corpus.

func NewVocab ¶

func NewVocab() *Vocab

NewVocab creates a new vocabulary, with the first token set to "<INVALID>", usually a placeholder for padding, and the second token set to "<START>" to indicate start of sentence.

func (*Vocab) RegisterToken ¶

func (v *Vocab) RegisterToken(token string) (idx TokenId)

RegisterToken returns the index for the token, and increments the count for the token.

func (*Vocab) SortByFrequency ¶

func (v *Vocab) SortByFrequency() (oldIDtoNewID map[TokenId]TokenId)

SortByFrequency sorts the vocabs by their frequency, and returns a map to convert the token ids from before the sorting to their new values.

Special tokens "<INVALID>" and "<START>" remain unchanged.

type VocabEntry ¶

type VocabEntry struct {
	Token string
	Count TokenId
}

VocabEntry include the Token and its count.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
demo IMDB Movie Review library (imdb) demo: you can run this program in 4 different ways:	IMDB Movie Review library (imdb) demo: you can run this program in 4 different ways:

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL