basically

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2021 License: MIT Imports: 0 Imported by: 0

README

basically

basically is a Go implementation of the TextRank and Biased TextRank algorithm built upon prose. It provides fully unsupervised methods for keyword extraction and focused text summarization, along with additional quality of life features over the original implementations.

Methods

First, the document is parsed into its constituent sentences and words using a sentence segmenter and tokenizer. Sentiment values are assigned to individual sentences, and tokens are annotated with part of speech tags.

For keyword extraction, all words that pass the syntactic filter are added to a undirected, weighted graph, and an edge is added between words that co-occur within a window of $N$ words. The edge weight is set to be inversely proportional to the distance between the words. Each vertex is assigned an initial score of 1, and the following ranking algorithm is run on the graph

During post-processing, adjacent keywords are collapsed into a multi-word keyword, and the top keywords are then extracted.

For sentence extraction, every sentence is added to a undirected, weighted graph, with an edge between sentences that share common content. The edge weight is set simply as the number of common tokens between the lexical representations of the two sentences. Each vertex is also assigned an initial score of 1, and a bias score based on the focus text, before the following ranking algorithm is run on the graph

The top weighted sentences are then selected and sorted in chronological order to form a summary.

Further information on the two algorithms can be found here and here.

Installation

go get https://github.com/algao1/basically

Usage

// Instantiate a document for every text.
doc, err := document.Create(text, &btrank.BiasedTextRank{}, &trank.KWTextRank{}, &parser.Parser{})
if err != nil {
	log.Fatal(err)
}

// Summarize the document into 7 sentences, with no threshold value, and with respect to a focus sentence.
sents, err := document.Summarize(7, 0, focus)
if err != nil {
	log.Fatal(err)
}

for _, sent := range sents {
	fmt.Printf("[%.2f, %.2f] %s\n", sum.Score, sum.Sentiment, sum.Raw)
}

// Highlight the top 7 keywords in the document, with multi-word keywords enabled.
words, err := document.Highlight(7, true)
if err != nil {
	log.Fatal(err)
}

for _, word := range words {
	fmt.Println(word.Weight, word.Word)
}

Optionally, we can also specify configurations such as retaining conjunctions at the beginning of sentences for our summary

doc, err := document.Create(text, &btrank.BiasedTextRank{}, &trank.KWTextRank{}, &parser.Parser{}, document.WithConjunctions())

Things I Learned

This project was started to better familiarize myself with Go, and some best practices

Next Steps

Currently the project is more or less complete, with no major foreseeable updates. However, I'll be periodically updating the library as things come to mind.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Document added in v0.1.1

type Document interface {
	Summarize(length int, threshold float64, focus string) ([]*Sentence, error)
	Highlight(length int, merge bool) ([]*Keyword, error)
	Characters() (int, int)
}

A Document represents a given text, and is responsible for handling the summarization and keyword extraction process.

type Highlighter added in v0.1.1

type Highlighter interface {
	Initialize(tokens []*Token, filter TokenFilter, window int)
	Rank(iters int)
	Highlight(length int, merge bool) ([]*Keyword, error)
}

A Highlighter is responsible for extracting key words from a document.

type Keyword added in v0.2.0

type Keyword struct {
	Word   string  // Raw keyword.
	Weight float64 // Weight of the keyword.
}

A Keyword is the keyword belonging to a highlighted document. A Keyword contains the raw word, and its associated weight.

type Parser added in v0.1.1

type Parser interface {
	ParseDocument(doc string, quote bool) ([]*Sentence, []*Token, error)
}

A Parser is responsible for parsing and tokenizing a document into strings and words. A Parser also performs additional tasks such as POS-tagging and sentiment analysis.

type Sentence added in v0.1.1

type Sentence struct {
	Raw       string   // Raw sentence string.
	Tokens    []*Token // Tokenized sentence.
	Sentiment float64  // Sentiment score.
	Score     float64  // Score (weight) of the sentence.
	Bias      float64  // Bias assigned to the sentence for ranking.
	Order     int      // The sentence's order in the text.
}

A Sentence represents an individual sentence within the text.

type Similarity added in v0.1.1

type Similarity func(n1, n2 []*Token, filter TokenFilter) float64

A Similarity computes the similarity of two sentences after applying the token filter.

type Summarizer added in v0.1.1

type Summarizer interface {
	Initialize(sents []*Sentence, similar Similarity, filter TokenFilter,
		focusString *Sentence, threshold float64)
	Rank(iters int)
}

A Summarizer is responsible for extracting key sentences from a document.

type Token added in v0.1.1

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Order int    // The token's order in the text.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TokenFilter added in v0.1.1

type TokenFilter func(*Token) bool

A TokenFilter represents a (black/white) filter applied to tokens before similarity calculations.

Directories

Path Synopsis
internal
prose
Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL