classifier

package module

v2.0.0 Latest Latest Go to latest Published: Sep 16, 2022 License: Apache-2.0 Imports: 16 Imported by: 25

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/google/licenseclassifier

Links

Open Source Insights

README ¶

License Classifier v2

This is a substantial revision of the license classifier with a focus on improved accuracy and performance.

Glossary

corpus dictionary - contains all the unique tokens stored in the corpus of documents to match. Any tokens in the target document that aren't in the corpus dictionary are mapped to an invalid value.
document - an internal-only data type that contains sequenced token information for a source or target content for matching.
source content - a body of text that can be matched by the scanner.
target content - the argument to Match that is scanned for matches with source content.
indexed document - an internal-only data type that maps a document to the corpus dictionary, resulting in a compressed representation suitable for fast text searching and mapping operations. an indexed document is necessarily tightly coupled to its corpus.
frequency table - a lookup table holding per-token counts of the number of times a token appears in content. used for fast filtering of target content against different source contents.
q-gram - a substring of content of length q tokens used to efficiently match ranges of text. For background on the q-gram algorithms used, please see Indexing Methods for Approximate String Matching
searchset - a data structure that uses q-grams to identify ranges of text in the target that correspond to a range of text in the source. The searchset algorithms compensate for the allowable error in matching text exactly, dealing with additional or missing tokens.

Migrating from v1

The API for the classifier versions is quite similar, but there are two key distinctions to be aware of while migrating usages.

The confidence value for the v2 classifier is applied uniformly to results; it will never return a match that is lower confidence than the threshold. In v1, MultipleMatch behaved this way, but NearestMatch would return a value regardless of the confidence match. Users often verified that the confidence was above the threshold, but this is no longer necessary.

The second change is that the classifier now returns all matches against the supplied corpus. The v1 classifier allowed filtering on header matches via a boolean field. This can be emulated by creating a license classifier with a reduced corpus if matching against headers is not desired. Alternatively, the user can use the MatchType field in the Match struct to filter out unwanted matches.

Documentation ¶

Overview ¶

Package classifier provides the implementation of the v2 license classifier.

Index ¶

func LicenseName(in string) string
type Classifier
- func NewClassifier(threshold float64) *Classifier
type Match
type Matches
type Results
type TraceConfiguration
type TraceFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func LicenseName ¶

func LicenseName(in string) string

LicenseName produces the output name for a license, removing the internal structure of the filename in use.

Types ¶

type Classifier ¶

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier provides methods for identifying open source licenses in text content.

func NewClassifier ¶

func NewClassifier(threshold float64) *Classifier

NewClassifier creates a classifier with an empty corpus.

func (*Classifier) AddContent ¶

func (c *Classifier) AddContent(category, name, variant string, content []byte)

AddContent incorporates the provided textual content into the classifier for matching. This will not modify the supplied content.

func (*Classifier) LoadLicenses ¶

func (c *Classifier) LoadLicenses(dir string) error

LoadLicenses adds the contents of the supplied directory to the corpus of the classifier.

func (*Classifier) Match ¶

func (c *Classifier) Match(in []byte) Results

Match finds matches within an unknown text. This will not modify the contents of the supplied byte slice.

func (*Classifier) MatchFrom ¶

func (c *Classifier) MatchFrom(in io.Reader) (Results, error)

MatchFrom finds matches within the read content.

func (*Classifier) Normalize ¶

func (c *Classifier) Normalize(in []byte) []byte

Normalize takes input content and applies the following transforms to aid in identifying license content. The return value of this function is line-separated text which is the basis for position values returned by the classifier.

1. Breaks up long lines of text. This helps with detecting licenses like in TODO(wcn):URL reference

2. Certain ignorable texts are removed to aid matching blocks of text. Introductory lines such as "The MIT License" are removed. Copyright notices are removed since the parties are variable and shouldn't impact matching.

It is NOT necessary to call this function to simply identify licenses in a file. It should only be called to aid presenting this information to the user in context (for example, creating diffs of differences to canonical licenses).

It is an invariant of the classifier that calling Match(Normalize(in)) will return the same results as Match(in).

func (*Classifier) SetTraceConfiguration ¶

func (c *Classifier) SetTraceConfiguration(in *TraceConfiguration)

SetTraceConfiguration installs a tracing configuration for the classifier.

type Match ¶

type Match struct {
	Name            string
	Confidence      float64
	MatchType       string
	Variant         string
	StartLine       int
	EndLine         int
	StartTokenIndex int
	EndTokenIndex   int
}

Match is the information about a single instance of a detected match.

type Matches ¶

type Matches []*Match

Matches is a sortable slice of Match.

func (Matches) Len ¶

func (d Matches) Len() int

func (Matches) Less ¶

func (d Matches) Less(i, j int) bool

func (Matches) Swap ¶

func (d Matches) Swap(i, j int)

Swap two elements of Matches.

type Results ¶

type Results struct {
	Matches         Matches
	TotalInputLines int
}

Results captures the summary information and matches detected by the classifier.

type TraceConfiguration ¶

type TraceConfiguration struct {
	// Comma-separated list of phases to be traced. Can use * for all phases.
	TracePhases string
	// Comma-separated list of licenses to be traced. Can use * as a suffix to
	// match prefixes, or by itself to match all licenses.
	TraceLicenses string

	// Tracer specifies a TraceFunc used to capture tracing information.
	// If not supplied, emits using fmt.Printf
	Tracer TraceFunc
	// contains filtered or unexported fields
}

TraceConfiguration specifies the configuration for tracing execution of the license classifier.

type TraceFunc ¶

type TraceFunc func(string, ...interface{})

TraceFunc works like fmt.Printf to emit tracing data for the classifier.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
assets
tools
identify_license command The identify_license program tries to identify the license type of an unknown license.	The identify_license program tries to identify the license type of an unknown license.
identify_license/backend Package backend contains the necessary functions to classify a license.	Package backend contains the necessary functions to classify a license.
identify_license/results Package results contains the result type returned by the classifier backend.	Package results contains the result type returned by the classifier backend.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL