licenseclassifier

package module
v0.0.0-...-c1ed8fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 4, 2022 License: Apache-2.0 Imports: 16 Imported by: 37

README

License Classifier

Build status

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A "confidence level" is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

  1. Create a file in licenses/.

    • The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
    • If the license is the "header" version of the license, append the suffix ".header" to it. See licenses/README.md for more details.
  2. Add the license name to the list in license_type.go.

  3. Regenerate the licenses.db file by running the license serializer:

    $ license_serializer -output licenseclassifier/licenses
    
  4. Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)
License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Documentation

Overview

Package licenseclassifier provides methods to identify the open source license that most closely matches an unknown license.

Index

Constants

View Source
const (
	// LicenseArchive is the name of the archive containing preprocessed
	// license texts.
	LicenseArchive = "licenses.db"
	// ForbiddenLicenseArchive is the name of the archive containing preprocessed
	// forbidden license texts only.
	ForbiddenLicenseArchive = "forbidden_licenses.db"
)
View Source
const (
	// The names come from the https://spdx.org/licenses website, and are
	// also the filenames of the licenses in licenseclassifier/licenses.
	AFL11                       = "AFL-1.1"
	AFL12                       = "AFL-1.2"
	AFL20                       = "AFL-2.0"
	AFL21                       = "AFL-2.1"
	AFL30                       = "AFL-3.0"
	AGPL10                      = "AGPL-1.0"
	AGPL30                      = "AGPL-3.0"
	Apache10                    = "Apache-1.0"
	Apache11                    = "Apache-1.1"
	Apache20                    = "Apache-2.0"
	APSL10                      = "APSL-1.0"
	APSL11                      = "APSL-1.1"
	APSL12                      = "APSL-1.2"
	APSL20                      = "APSL-2.0"
	Artistic10cl8               = "Artistic-1.0-cl8"
	Artistic10Perl              = "Artistic-1.0-Perl"
	Artistic10                  = "Artistic-1.0"
	Artistic20                  = "Artistic-2.0"
	BCL                         = "BCL"
	Beerware                    = "Beerware"
	BSD2ClauseFreeBSD           = "BSD-2-Clause-FreeBSD"
	BSD2ClauseNetBSD            = "BSD-2-Clause-NetBSD"
	BSD2Clause                  = "BSD-2-Clause"
	BSD3ClauseAttribution       = "BSD-3-Clause-Attribution"
	BSD3ClauseClear             = "BSD-3-Clause-Clear"
	BSD3ClauseLBNL              = "BSD-3-Clause-LBNL"
	BSD3Clause                  = "BSD-3-Clause"
	BSD4Clause                  = "BSD-4-Clause"
	BSD4ClauseUC                = "BSD-4-Clause-UC"
	BSDProtection               = "BSD-Protection"
	BSL10                       = "BSL-1.0"
	CC010                       = "CC0-1.0"
	CCBY10                      = "CC-BY-1.0"
	CCBY20                      = "CC-BY-2.0"
	CCBY25                      = "CC-BY-2.5"
	CCBY30                      = "CC-BY-3.0"
	CCBY40                      = "CC-BY-4.0"
	CCBYNC10                    = "CC-BY-NC-1.0"
	CCBYNC20                    = "CC-BY-NC-2.0"
	CCBYNC25                    = "CC-BY-NC-2.5"
	CCBYNC30                    = "CC-BY-NC-3.0"
	CCBYNC40                    = "CC-BY-NC-4.0"
	CCBYNCND10                  = "CC-BY-NC-ND-1.0"
	CCBYNCND20                  = "CC-BY-NC-ND-2.0"
	CCBYNCND25                  = "CC-BY-NC-ND-2.5"
	CCBYNCND30                  = "CC-BY-NC-ND-3.0"
	CCBYNCND40                  = "CC-BY-NC-ND-4.0"
	CCBYNCSA10                  = "CC-BY-NC-SA-1.0"
	CCBYNCSA20                  = "CC-BY-NC-SA-2.0"
	CCBYNCSA25                  = "CC-BY-NC-SA-2.5"
	CCBYNCSA30                  = "CC-BY-NC-SA-3.0"
	CCBYNCSA40                  = "CC-BY-NC-SA-4.0"
	CCBYND10                    = "CC-BY-ND-1.0"
	CCBYND20                    = "CC-BY-ND-2.0"
	CCBYND25                    = "CC-BY-ND-2.5"
	CCBYND30                    = "CC-BY-ND-3.0"
	CCBYND40                    = "CC-BY-ND-4.0"
	CCBYSA10                    = "CC-BY-SA-1.0"
	CCBYSA20                    = "CC-BY-SA-2.0"
	CCBYSA25                    = "CC-BY-SA-2.5"
	CCBYSA30                    = "CC-BY-SA-3.0"
	CCBYSA40                    = "CC-BY-SA-4.0"
	CDDL10                      = "CDDL-1.0"
	CDDL11                      = "CDDL-1.1"
	CommonsClause               = "Commons-Clause"
	CPAL10                      = "CPAL-1.0"
	CPL10                       = "CPL-1.0"
	EGenix                      = "eGenix"
	EPL10                       = "EPL-1.0"
	EPL20                       = "EPL-2.0"
	EUPL10                      = "EUPL-1.0"
	EUPL11                      = "EUPL-1.1"
	Facebook2Clause             = "Facebook-2-Clause"
	Facebook3Clause             = "Facebook-3-Clause"
	FacebookExamples            = "Facebook-Examples"
	FreeImage                   = "FreeImage"
	FTL                         = "FTL"
	GPL10                       = "GPL-1.0"
	GPL20                       = "GPL-2.0"
	GPL20withautoconfexception  = "GPL-2.0-with-autoconf-exception"
	GPL20withbisonexception     = "GPL-2.0-with-bison-exception"
	GPL20withclasspathexception = "GPL-2.0-with-classpath-exception"
	GPL20withfontexception      = "GPL-2.0-with-font-exception"
	GPL20withGCCexception       = "GPL-2.0-with-GCC-exception"
	GPL30                       = "GPL-3.0"
	GPL30withautoconfexception  = "GPL-3.0-with-autoconf-exception"
	GPL30withGCCexception       = "GPL-3.0-with-GCC-exception"
	GUSTFont                    = "GUST-Font-License"
	ImageMagick                 = "ImageMagick"
	IPL10                       = "IPL-1.0"
	ISC                         = "ISC"
	LGPL20                      = "LGPL-2.0"
	LGPL21                      = "LGPL-2.1"
	LGPL30                      = "LGPL-3.0"
	LGPLLR                      = "LGPLLR"
	Libpng                      = "Libpng"
	Lil10                       = "Lil-1.0"
	LinuxOpenIB                 = "Linux-OpenIB"
	LPL102                      = "LPL-1.02"
	LPL10                       = "LPL-1.0"
	LPPL13c                     = "LPPL-1.3c"
	MIT                         = "MIT"
	MPL10                       = "MPL-1.0"
	MPL11                       = "MPL-1.1"
	MPL20                       = "MPL-2.0"
	MSPL                        = "MS-PL"
	NCSA                        = "NCSA"
	NPL10                       = "NPL-1.0"
	NPL11                       = "NPL-1.1"
	OFL11                       = "OFL-1.1"
	OpenSSL                     = "OpenSSL"
	OpenVision                  = "OpenVision"
	OSL10                       = "OSL-1.0"
	OSL11                       = "OSL-1.1"
	OSL20                       = "OSL-2.0"
	OSL21                       = "OSL-2.1"
	OSL30                       = "OSL-3.0"
	PHP301                      = "PHP-3.01"
	PHP30                       = "PHP-3.0"
	PIL                         = "PIL"
	PostgreSQL                  = "PostgreSQL"
	Python20complete            = "Python-2.0-complete"
	Python20                    = "Python-2.0"
	QPL10                       = "QPL-1.0"
	Ruby                        = "Ruby"
	SGIB10                      = "SGI-B-1.0"
	SGIB11                      = "SGI-B-1.1"
	SGIB20                      = "SGI-B-2.0"
	SISSL12                     = "SISSL-1.2"
	SISSL                       = "SISSL"
	Sleepycat                   = "Sleepycat"
	UnicodeTOU                  = "Unicode-TOU"
	UnicodeDFS2015              = "Unicode-DFS-2015"
	UnicodeDFS2016              = "Unicode-DFS-2016"
	Unlicense                   = "Unlicense"
	UPL10                       = "UPL-1.0"
	W3C19980720                 = "W3C-19980720"
	W3C20150513                 = "W3C-20150513"
	W3C                         = "W3C"
	WTFPL                       = "WTFPL"
	X11                         = "X11"
	Xnet                        = "Xnet"
	Zend20                      = "Zend-2.0"
	ZeroBSD                     = "0BSD"
	ZlibAcknowledgement         = "zlib-acknowledgement"
	Zlib                        = "Zlib"
	ZPL11                       = "ZPL-1.1"
	ZPL20                       = "ZPL-2.0"
	ZPL21                       = "ZPL-2.1"
)

Canonical names of the licenses.

View Source
const DefaultConfidenceThreshold = 0.80

DefaultConfidenceThreshold is the minimum confidence percentage we're willing to accept in order to say that a match is good.

Variables

View Source
var (

	// LicenseTypes is a set of the types of licenses Google recognizes.
	LicenseTypes = sets.NewStringSet(
		"restricted",
		"reciprocal",
		"notice",
		"permissive",
		"unencumbered",
		"by_exception_only",
	)
)
View Source
var (
	// Normalizers is a list of functions that get applied to the strings
	// before they are registered with the string classifier.
	Normalizers = []stringclassifier.NormalizeFunc{
		html.UnescapeString,
		removeShebangLine,
		RemoveNonWords,
		NormalizeEquivalentWords,
		NormalizePunctuation,
		strings.ToLower,
		removeIgnorableTexts,
		stringclassifier.FlattenWhitespace,
		strings.TrimSpace,
	}
)
View Source
var ReadLicenseDir = licenses.ReadLicenseDir

ReadLicenseDir reads directory containing the license files.

View Source
var ReadLicenseFile = licenses.ReadLicenseFile

ReadLicenseFile locates and reads the license archive file. Absolute paths are used unmodified. Relative paths are expected to be in the licenses directory of the licenseclassifier package.

Functions

func CopyrightHolder

func CopyrightHolder(contents string) string

CopyrightHolder finds a copyright notification, if it exists, and returns the copyright holder.

func LicenseType

func LicenseType(name string) string

LicenseType returns the type the license has.

func NormalizeEquivalentWords

func NormalizeEquivalentWords(s string) string

NormalizeEquivalentWords normalizes equivalent words that are interchangeable.

func NormalizePunctuation

func NormalizePunctuation(s string) string

NormalizePunctuation takes all hyphens and quotes and normalizes them.

func RemoveNonWords

func RemoveNonWords(s string) string

RemoveNonWords removes non-words from the string.

func TrimExtraneousTrailingText

func TrimExtraneousTrailingText(s string) string

TrimExtraneousTrailingText removes text after an obvious end of the license and does not include substantive text of the license.

Types

type License

type License struct {

	// Threshold is the lowest confidence percentage acceptable for the
	// classifier.
	Threshold float64
	// contains filtered or unexported fields
}

License is a classifier pre-loaded with known open source licenses.

func New

func New(threshold float64, options ...OptionFunc) (*License, error)

New creates a license classifier and pre-loads it with known open source licenses.

func NewWithForbiddenLicenses

func NewWithForbiddenLicenses(threshold float64, options ...OptionFunc) (*License, error)

NewWithForbiddenLicenses creates a license classifier and pre-loads it with known open source licenses which are forbidden.

func (*License) HasPublicDomainNotice

func (c *License) HasPublicDomainNotice(contents string) bool

HasPublicDomainNotice performs a simple regex over the contents to see if a public domain notice is in there. As you can imagine, this isn't 100% definitive, but can be useful if a license match isn't found.

func (*License) MultipleMatch

func (c *License) MultipleMatch(contents string, includeHeaders bool) stringclassifier.Matches

MultipleMatch matches all licenses within an unknown text.

func (*License) NearestMatch

func (c *License) NearestMatch(contents string) *stringclassifier.Match

NearestMatch returns the "nearest" match to the given set of known licenses. Returned are the name of the license, and a confidence percentage indicating how confident the classifier is in the result.

func (*License) WithinConfidenceThreshold

func (c *License) WithinConfidenceThreshold(conf float64) bool

WithinConfidenceThreshold returns true if the confidence value is above or equal to the confidence threshold.

type OptionFunc

type OptionFunc func(l *License) error

OptionFunc set options on a License struct.

func Archive

func Archive(f string) OptionFunc

Archive is an OptionFunc to specify the location of the license archive file.

func ArchiveBytes

func ArchiveBytes(b []byte) OptionFunc

ArchiveBytes is an OptionFunc that provides the contents of the license archive file. The caller must not overwrite the contents of b as it is not copied.

func ArchiveFunc

func ArchiveFunc(f func() ([]byte, error)) OptionFunc

ArchiveFunc is an OptionFunc that provides a function that must return the contents of the license archive file.

Directories

Path Synopsis
Package commentparser does a basic parse over a source file and returns all of the comments from the code.
Package commentparser does a basic parse over a source file and returns all of the comments from the code.
language
Package language contains methods and information about the different programming languages the comment parser supports.
Package language contains methods and information about the different programming languages the comment parser supports.
internal
sets
Package sets provides sets for storing collections of unique elements.
Package sets provides sets for storing collections of unique elements.
Package serializer normalizes the license text and calculates the hash values for all substrings in the license.
Package serializer normalizes the license text and calculates the hash values for all substrings in the license.
Package stringclassifier finds the nearest match between a string and a set of known values.
Package stringclassifier finds the nearest match between a string and a set of known values.
internal/pq
Package pq provides a priority queue.
Package pq provides a priority queue.
searchset
Package searchset generates hashes for all substrings of a text.
Package searchset generates hashes for all substrings of a text.
searchset/tokenizer
Package tokenizer converts a text into a stream of tokens.
Package tokenizer converts a text into a stream of tokens.
tools
identify_license
The identify_license program tries to identify the license type of an unknown license.
The identify_license program tries to identify the license type of an unknown license.
identify_license/backend
Package backend contains the necessary functions to classify a license.
Package backend contains the necessary functions to classify a license.
identify_license/results
Package results contains the result type returned by the classifier backend.
Package results contains the result type returned by the classifier backend.
license_serializer
The license_serializer program normalizes and serializes the known licenseclassifier licenses into a compressed archive.
The license_serializer program normalizes and serializes the known licenseclassifier licenses into a compressed archive.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL