dictionary

package
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 23, 2019 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

Package dictionary contains code related to looking up and storing words in dictionaries. The parser currently supports the Project Gutenberg's edition of the Webster's Unabridged 1913 dictionary.

Index

Constants

View Source
const FileVer = "DICT5\x00"

FileVer is the current compatibility level of saved Files.

Variables

This section is empty.

Functions

func CreateFile

func CreateFile(wm WordMap, dictfile string) error

CreateFile exports a WordMap to a file. The files specified will be overwritten if they exist.

Types

type File

type File struct {
	// contains filtered or unexported fields
}

File implements an efficient Store which is faster to initialize and uses a lot less memory (~15 MB total) than WordMap.

There needs to be enough memory to store the whole index. Reading a dict is also completely thread-safe. Corrupt files will be detected during the read of the corrupted word (or the initialization in the case of index corruption) or during Verify.

The dict file is stored in the following format:

+-----------+--------------+-----------------------------------------------+------------+-------------------------------------------------+
|           |              |  +------+------------------------------+      |            |                                                 |
|  FileVer  |  idx offset  |  | size | zlib compressed Word msgpack | ...  |  idx size  |  zlib compressed idx map[string]offset msgpack  |
|           |              |  +------+------------------------------+      |            |                                                 |
+-----------+--------------+-----------------------------------------------+---- -------+-------------------------------------------------+

All sizes and offsets are little-endian int64.

The file is opened using the following steps:

1. The FileVer is read and checked. It must match exactly. 2. The idx offset is read. 3. The file is seeked to the beginning plus the idx offset. 4. The idx size is read. 5. The bytes for the idx are decompressed using zlib, and the resulting msgpack is decoded into an in-memory map[string]int64 of the words to offsets.

To read a word:

1. The offset is retrieved from the in-memory idx. 2. The file is seeked to the beginning plus the offset. 4. The size of the compressed word is read. 5. The bytes for the word are decompressed using zlib, and the resulting msgpack is decoded into an in-memory *Word.

To create the file:

  1. The FileVer is written.
  2. A placeholder for the idx offset is written.
  3. The dictionary map is looped over: a. If the referenced Word has already been written, it is skipped. b. The Word is encoded with msgpack, compressed and written to a temporary buf. c. The size is written. d. The buf is written and reset.
  4. The current offset is stored for the idx offset.
  5. A placeholder for the idx size is written.
  6. The idx is encoded with msgpack, compressed, and written.
  7. The idx size is calculated by subtracting the idx offset and the size of the idx size from the current offset.
  8. The file is seeked to the idx offset and the idx size is written.
  9. The file is seeked to the beginning plus the length of FileVer.
  10. The idx offset is written.

func OpenFile

func OpenFile(dictfile string) (*File, error)

OpenFile opens a dictionary file. It will return errors if there are errors reading the files or critical errors in the structure.

func (*File) Close

func (d *File) Close() error

Close closes the files associated with the dictionary file and clears the in-memory index. Usage of the File afterwards may result in a panic.

func (*File) GetWord

func (d *File) GetWord(word string) (*Word, bool, error)

GetWord implements Store, and will return an error if the data structure is invalid or the underlying files are inaccessible.

func (*File) HasWord

func (d *File) HasWord(word string) bool

HasWord implements Store.

func (*File) Lookup

func (d *File) Lookup(word string) (*Word, bool, error)

Lookup is a shortcut for Lookup.

func (*File) NumWords

func (d *File) NumWords() int

NumWords implements Store.

func (*File) Verify

func (d *File) Verify() error

Verify verifies the consistency of the data structures in the dict file. WARNING: Verify takes a few seconds to run.

type Store

type Store interface {
	// HasWord checks if the Store contains a word as-is (i.e. do not do any additional processing or trimming).
	HasWord(word string) bool
	// GetWord gets a word from the Store. If it does not exist, exists will be false, and word and err will be nil.
	GetWord(word string) (w *Word, exists bool, err error)
	// NumWords returns the number of words in the Store.
	NumWords() int
	// Lookup should call Lookup on itself.
	Lookup(word string) (*Word, bool, error)
}

Store is a backend for storing dictionary entries.

type Word

type Word struct {
	Word       string   `json:"word,omitempty" msgpack:"w"`
	Alternates []string `json:"alternates,omitempty" msgpack:"a"`
	Info       string   `json:"info,omitempty" msgpack:"i"`
	Etymology  string   `json:"etymology,omitempty" msgpack:"e"`
	Meanings   []struct {
		Text    string `json:"text,omitempty" msgpack:"t"`
		Example string `json:"example,omitempty" msgpack:"e"`
	} `json:"meanings,omitempty" msgpack:"m"`
	Extra  string `json:"extra,omitempty" msgpack:"x"`
	Credit string `json:"credit,omitempty" msgpack:"c"`
}

Word represents a word.

func Lookup

func Lookup(store Store, word string) (*Word, bool, error)

Lookup looks up a word in the dictionary. It applies stemming to the word if no direct match is found. If the entry is a reference to another, it will insert the referenced meanings into the result.

type WordMap

type WordMap map[string]*Word

WordMap is an im-memory word Store used and returned by Parse. Although fast, it consumes huge amounts of memory (~500 MB) and shouldn't be used if possible.

func Parse

func Parse(rd io.Reader) (WordMap, error)

Parse parses Webster's Unabridged Dictionary of 1913 into a WordMap. It will only return an error if the reader returns one. If the data is corrupt, the results are undefined (but will be tried to be parsed as best as possible). WARNING: Parse uses huge amounts of memory (~600 MB) and cpu time (~30s).

func (WordMap) GetWord

func (wm WordMap) GetWord(word string) (*Word, bool, error)

GetWord implements Store, but will never return an error.

func (WordMap) HasWord

func (wm WordMap) HasWord(word string) bool

HasWord implements Store.

func (WordMap) Lookup

func (wm WordMap) Lookup(word string) (*Word, bool, error)

Lookup is a shortcut for Lookup.

func (WordMap) NumWords

func (wm WordMap) NumWords() int

NumWords implements Store.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL