assocentity

package module

v14.0.1 Latest Latest Go to latest Published: May 27, 2023 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ndabAP/assocentity

Links

Open Source Insights

README ¶

assocentity

Package assocentity is a social science tool to analyze the relative distance from tokens to entities. The motiviation is to make conclusions based on the distance from interesting tokens to a certain entity and its synonyms. Visit this website to see an usage example.

Features

Provide your own tokenizer
Provides a default NLP tokenizer (by Google)
Define aliases for entities
Provides a multi-OS, language-agnostic CLI version

Installation

$ go get github.com/ndabAP/assocentity/v14

Prerequisites

If you want to analyze human readable texts you can use the provided Natural Language tokenizer (powered by Google). To do so, sign-up for a Cloud Natural Language API service account key and download the generated JSON file. This equals the credentialsFile at the example below. You should never commit that file.

A possible offline tokenizer would be a white space tokenizer. You also might use a parser depending on your purposes.

Example

We would like to find out which adjectives are how close in average to a certain public person. Let's take George W. Bush and 1,000 NBC news articles as an example. "George Bush" is the entity and synonyms are "George Walker Bush" and "Bush" and so on. The text is each of the 1,000 NBC news articles.

Defining a text source and to set the entity would be first step. Next, we need to instantiate our tokenizer. In this case, we use the provided Google NLP tokenizer. Finally, we can calculate our mean distances. We can use assocentity.Distances, which accepts multiple texts. Notice how we pass tokenize.ADJ to only include adjectives as part of speech. Finally, we can take the mean by passing the result to assocentity.Mean.

// Define texts source and entity
texts := []string{
	"Former Presidents Barack Obama, Bill Clinton and ...", // Truncated
	"At the pentagon on the afternoon of 9/11, ...",
	"Tony Blair moved swiftly to place his relationship with ...",
}
entities := []string{
	"Goerge Walker Bush",
	"Goerge Bush",
	"Bush",
}
source := assocentity.NewSource(entities, texts)

// Instantiate the NLP tokenizer (powered by Google)
nlpTok := nlp.NewNLPTokenizer(credentialsFile, nlp.AutoLang)

// Get the distances to adjectives
ctx := context.TODO()
dists, err := assocentity.Distances(ctx, nlpTok, tokenize.ADJ, source)
if err != nil {
	// Handle error
}
// Get the mean from the distances
mean := assocentity.Mean(dists)

The NLPTokenizer has a built-in retryer with a strategy that went well with the Google Language API limitations. It can't be disabled or configured.

Tokenization

A Tokenizer is something that produces tokens with a given text. While a Token is the smallest possible unit of a text. The interface with the method Tokenize has the following signature:

type Tokenizer interface {
	Tokenize(ctx context.Context, text string) ([]Token, error)
}

A Token has the following properties:

type Token struct {
	PoS  PoS    // Part of speech
	Text string // Text
}

// Part of speech
type PoS int

For example, given the text:

text := "Punchinello was burning to get me"

The result from Tokenize would be a slice of tokens:

[]Token{
	{
		Text: "Punchinello",
		PoS:  tokenize.NOUN,
	},
	{
		Text: "was",
		PoS:  tokenize.VERB,
	},
	{
		Text: "burning",
		PoS:  tokenize.VERB,
	},
	{
		Text: "to",
		PoS:  tokenize.PRT,
	},
	{
		Text: "get",
		PoS:  tokenize.VERB,
	},
	{
		Text: "me",
		PoS:  tokenize.PRON,
	},
}

CLI

There is also a language-agnostic terminal version available for either Windows, Mac (Darwin) or Linux (only with 64-bit support) if you don't have Go available. The application expects the text from "stdin" and accepts the following flags:

Flag	Description	Type	Default
`entities`	List of comma separated entities, example: `-entities="Max Payne,Payne"`	`string`
`google-svc-acc-key`	Google Clouds NLP JSON service account file, example: `-google-svc-acc-key=~/google-svc-acc-key.json`	`string`
`op`	Operation to execute, default is `mean`	`string`	`mean`
`pos`	List of comma separated part of speeches, example: `-pos=noun,verb,pron`	`string`	`any`

Example:

echo "Relax, Max. You're a nice guy." | ./bin/assocentity_linux_amd64_v14.0.0-0-g948274a-dirty -gog-svc-loc=/home/max/.config/assocentity/google-service.json -entities="Max Payne,Payne,Max"

The output is written to "stdout" in appropoiate formats.

Projects using assocentity

entityscrape - Distance between word types (default: adjectives) in news articles and persons

Author

Julian Claus and contributors.

License

MIT

Documentation ¶

Index ¶

func Distances(ctx context.Context, tokenizer tokenize.Tokenizer, poS tokenize.PoS, ...) (map[tokenize.Token][]float64, error)
func Mean(dists map[tokenize.Token][]float64) map[tokenize.Token]float64
func NewSource(entities, texts []string) source
func Normalize(dists map[tokenize.Token][]float64, norm Normalizer)
func Threshold(dists map[tokenize.Token][]float64, threshold float64)
type Normalizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Distances ¶

func Distances(
	ctx context.Context,
	tokenizer tokenize.Tokenizer,
	poS tokenize.PoS,
	source source,
) (map[tokenize.Token][]float64, error)

Distances returns the distances from entities to a list of texts

func Mean ¶

func Mean(dists map[tokenize.Token][]float64) map[tokenize.Token]float64

Mean returns the mean of the provided distances

func NewSource ¶

func NewSource(entities, texts []string) source

NewSource returns a new source consisting of entities and texts

func Normalize ¶

func Normalize(dists map[tokenize.Token][]float64, norm Normalizer)

Normalize normalizes tokens with provided normalizer

func Threshold ¶

func Threshold(dists map[tokenize.Token][]float64, threshold float64)

Threshold excludes results that are below the given threshold. The threshold is described through the amount of distances per token relative to the total amount of tokens

Types ¶

type Normalizer ¶

type Normalizer func(tokenize.Token) tokenize.Token

Normalizer normalizes tokens like lower casing them to increase the overall token quality

var HumanReadableNormalizer Normalizer = func(tok tokenize.Token) tokenize.Token {
	t := tokenize.Token{
		PoS:  tok.PoS,
		Text: strings.ToLower(tok.Text),
	}

	switch tok.Text {
	case "&":
		t.Text = "and"
	}

	return t
}

HumanReadableNormalizer normalizes tokens through lower casing them and replacing them with their synonyms. Note: It assumes English as input language

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cli
internal
comp
iterator
pos
nlp
tokenize

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL