documentio

package
v0.0.0-...-804d1b8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 19, 2023 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Overview

tfidf provides IO functions to read text files and also remembers the source of the text.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Read

func Read(ctx context.Context, s beam.Scope, input string, corpusLen int, filter string, entries beam.PCollection) beam.PCollection

Read a PCollection of CorpusEntry and return a PCollection of DocEntry with the contents of the file contained in the Text field.

Types

type CorpusEntry

type CorpusEntry struct {
	RawFile    string `beam:"rawFile"`
	DocumentId string `beam:"glossFile"`
	ColFile    string `beam:"colFile"`
}

CorpusEntry contains metadata for a document that text will be read from

type DocEntry

type DocEntry struct {
	Text       string `beam:"text"`
	DocumentId string `beam:"glossFile"`
	ColFile    string `beam:"colFile"`
	CorpusLen  int    `beam:"corpusLen"`
}

DocEntry contains all text and metadata for the document that it was extracted from

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL