classifier

package module
v2.0.0-alpha.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 29, 2020 License: Apache-2.0 Imports: 15 Imported by: 0

README

License Classifier v2

This is a substantial revision of the license classifier with a focus on improved accuracy and performance.

Glossary

  • document - an internal-only data type that contains token sequence information for a source or target content for matching.

  • source content - a body of text that can be matched by the scanner.

  • target content - the argument to Match that is scanned for matches with source content.

  • indexed document - an internal-only data type that maps a document to the corpus dictionary, resulting in a compressed representation suitable for fast text searching and mapping operations. an indexed document is necessarily tightly coupled to its corpus.

  • frequency table - a lookup table holding per-token counts of the number of times a token appears in content. used for fast filtering of target content against different source contents.

  • q-gram - a substring of content of length q tokens used to efficiently match ranges of text. For background on the q-gram algorithms used, please see Indexing Methods for Approximate String Matching

  • searchset - a data structure that uses q-grams to identify ranges of text in the target that correspond to a range of text in the source. The searchset algorithms compensate for the allowable error in matching text exactly, dealing with additional or missing tokens.

Documentation

Overview

Package classifier provides the implementation of the v2 Google license classifier.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func LicenseName

func LicenseName(in string) string

LicenseName produces the output name for a license, removing the internal structure of the filename in use.

Types

type Classifier

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier provides methods for identifying open source licenses in text content.

func NewClassifier

func NewClassifier(threshold float64) *Classifier

NewClassifier creates a classifier with an empty corpus.

func (*Classifier) AddContent

func (c *Classifier) AddContent(name, content string)

AddContent incorporates the provided textual content into the classifier for matching.

func (*Classifier) LoadLicenses

func (c *Classifier) LoadLicenses(dir string) error

LoadLicenses adds the contents of the supplied directory to the corpus of the classifier.

func (*Classifier) Match

func (c *Classifier) Match(in string) Matches

Match finds matches within an unknown text.

func (*Classifier) SetTraceConfiguration

func (c *Classifier) SetTraceConfiguration(in *TraceConfiguration)

SetTraceConfiguration installs a tracing configuration for the classifier.

type Match

type Match struct {
	Name            string
	Confidence      float64
	MatchType       string
	StartLine       int
	EndLine         int
	StartTokenIndex int
	EndTokenIndex   int
}

Match is the information about a single instance of a detected match.

type Matches

type Matches []*Match

Matches is a sortable slice of Match.

func (Matches) Len

func (d Matches) Len() int

func (Matches) Less

func (d Matches) Less(i, j int) bool

func (Matches) Swap

func (d Matches) Swap(i, j int)

Swap two elements of Matches.

type TraceConfiguration

type TraceConfiguration struct {
	// Comma-separated list of phases to be traced. Can use * for all phases.
	TracePhases string
	// Comma-separated list of licenses to be traced. Can use * as a suffix to
	// match prefixes, or by itself to match all licenses.
	TraceLicenses string

	// Tracer specifies a TraceFunc used to capture tracing information.
	// If not supplied, emits using fmt.Printf
	Tracer TraceFunc
	// contains filtered or unexported fields
}

TraceConfiguration specifies the configuration for tracing execution of the license classifier.

type TraceFunc

type TraceFunc func(string, ...interface{})

TraceFunc works like fmt.Printf to emit tracing data for the classifier.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL