fasttld

package module

v0.2.1 Latest Latest Go to latest Published: Jun 2, 2022 License: BSD-3-Clause Imports: 17 Imported by: 6

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/elliotwutingfeng/go-fasttld

Links

Open Source Insights

README ¶

go-fasttld

go-fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs, IPv4 addresses, and IPv6 addresses.

Trie

Background

go-fasttld extracts subcomponents like top level domains (TLDs), subdomains and hostnames from URLs efficiently by using the regularly-updated Mozilla Public Suffix List and the compressed trie data structure.

For example, it extracts the com TLD, maps subdomain, and google domain from https://maps.google.com:8080/a/long/path/?query=42.

go-fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com', extraction of IPv4 addresses, and extraction of IPv6 addresses.

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple TLDs like .com, but not more complex ones like oseto.nagasaki.jp.

Compressed trie example

Valid TLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following TLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid TLD
✅ : path to this node found in example URL host `example.nsw.edu.au`

The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw. Reversing the nodes gives the extracted TLD nsw.edu.au.

Installation

go get github.com/elliotwutingfeng/go-fasttld

Quick Start

Full demo available in the examples folder

Domain

// Initialise fasttld extractor
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

//Extract URL subcomponents
url := "https://some-user@a.long.subdomain.ox.ac.uk:5000/a/b/c/d/e/f/g/h/i?id=42"
res := extractor.Extract(fasttld.URLParams{URL: url})

// Display results
fmt.Println(res.Scheme)           // https://
fmt.Println(res.UserInfo)         // some-user
fmt.Println(res.SubDomain)        // a.long.subdomain
fmt.Println(res.Domain)           // ox
fmt.Println(res.Suffix)           // ac.uk
fmt.Println(res.RegisteredDomain) // ox.ac.uk
fmt.Println(res.Port)             // 5000
fmt.Println(res.Path)             // /a/b/c/d/e/f/g/h/i?id=42

IPv4 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url = "https://127.0.0.1:5000"
res = extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = 127.0.0.1
// res.Suffix = <no output>
// res.RegisteredDomain = 127.0.0.1
// res.Port = 5000
// res.Path = <no output>

IPv6 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url = "https://[aBcD:ef01:2345:6789:aBcD:ef01:2345:6789]:5000"
res = extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = aBcD:ef01:2345:6789:aBcD:ef01:2345:6789
// res.Suffix = <no output>
// res.RegisteredDomain = aBcD:ef01:2345:6789:aBcD:ef01:2345:6789
// res.Port = 5000
// res.Path = <no output>

Internationalised label separators

go-fasttld supports the following internationalised label separators (IETF RFC 3490)

U+002E (full stop)
U+3002 (ideographic full stop)
U+FF0E (fullwidth full stop)
U+FF61 (halfwidth ideographic full stop)

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url = "https://brb\u002ei\u3002am\uff0egoing\uff61to\uff0ebe\u3002a\uff61fk"
res = extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = brb\u002ei\u3002am\uff0egoing\uff61to
// res.Domain = be
// res.Suffix = a\uff61fk
// res.RegisteredDomain = be\u3002a\uff61fk
// res.Port = <no output>
// res.Path = <no output>

Public Suffix List options

Specify custom public suffix list file

You can use a custom public suffix list file by setting CacheFilePath in fasttld.SuffixListParams{} to its absolute path.

cacheFilePath := "/absolute/path/to/file.dat"
extractor, _ := fasttld.New(fasttld.SuffixListParams{CacheFilePath: cacheFilePath})

Updating the default Public Suffix List cache

Whenever fasttld.New is called without specifying CacheFilePath in fasttld.SuffixListParams{}, the local cache of the default Public Suffix List is updated automatically if it is more than 3 days old. You can also manually update the cache by using Update().

// Automatic update performed if `CacheFilePath` is not specified
// and local cache is more than 3 days old
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Manually update local cache
if err := extractor.Update(); err != nil {
    log.Println(err)
}

Private domains

According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.com and sinaapp.com.

By default, go-fasttld excludes these private domains (i.e. IncludePrivateSuffix = false)

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = google
// res.Domain = blogspot
// res.Suffix = com
// res.RegisteredDomain = blogspot.com
// res.Port = <no output>
// res.Path = <no output>

You can include private domains by setting IncludePrivateSuffix = true

extractor, _ := fasttld.New(fasttld.SuffixListParams{IncludePrivateSuffix: true})

url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = blogspot.com
// res.RegisteredDomain = google.blogspot.com
// res.Port = <no output>
// res.Path = <no output>

Extraction options

Ignore Subdomains

You can ignore subdomains by setting IgnoreSubDomains = true. By default, subdomains are extracted.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://maps.google.com"
res := extractor.Extract(fasttld.URLParams{URL: url, IgnoreSubDomains: true})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = com
// res.RegisteredDomain = google.com
// res.Port = <no output>
// res.Path = <no output>

Punycode

Convert internationalised URLs to punycode before extraction by setting ConvertURLToPunyCode = true. By default, URLs are not converted to punycode.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://hello.世界.com"
res := extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: true})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = xn--rhqv96g
// res.Suffix = com
// res.RegisteredDomain = xn--rhqv96g.com
// res.Port = <no output>
// res.Path = <no output>

res = extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: false})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = 世界
// res.Suffix = com
// res.RegisteredDomain = 世界.com
// res.Port = <no output>
// res.Path = <no output>

Testing

go test -v -coverprofile=test_coverage.out && go tool cover -html=test_coverage.out -o test_coverage.html

Benchmarks

go test -bench=. -benchmem -cpu 1

Modules used

Benchmark Name	Source
GoFastTld	go-fasttld (this module)
JPilloraGoTld	github.com/jpillora/go-tld
JoeGuoTldExtract	github.com/joeguo/tldextract
Mjd2021USATldExtract	github.com/mjd2021usa/tldextract
M507Tlde	github.com/M507/tlde

Results

Benchmarks performed on AMD Ryzen 7 5800X, Manjaro Linux.

go-fasttld performs especially well on longer URLs.

#1

https://news.google.com

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	2252048	532.2 ns/op	144 B/op	2 allocs/op
JPilloraGoTld	2409051	493.7 ns/op	224 B/op	2 allocs/op	✔
JoeGuoTldExtract	1330557	899.2 ns/op	208 B/op	7 allocs/op
Mjd2021USATldExtract	1430215	839.8 ns/op	208 B/op	7 allocs/op
M507Tlde	2406926	499.1 ns/op	160 B/op	5 allocs/op

#2

https://iupac.org/iupac-announces-the-2021-top-ten-emerging-technologies-in-chemistry/

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	2510493	484.2 ns/op	144 B/op	2 allocs/op	✔
JPilloraGoTld	1654834	728.2 ns/op	224 B/op	2 allocs/op
JoeGuoTldExtract	1381051	859.7 ns/op	288 B/op	6 allocs/op
Mjd2021USATldExtract	1492414	802.7 ns/op	288 B/op	6 allocs/op
M507Tlde	2134239	565.1 ns/op	272 B/op	5 allocs/op

#3

https://www.google.com/maps/dir/Parliament+Place,+Parliament+House+Of+Singapore,+Singapore/Parliament+St,+London,+UK/@25.2440033,33.6721455,4z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x31da19a0abd4d71d:0xeda26636dc4ea1dc!2m2!1d103.8504863!2d1.2891543!1m5!1m1!1s0x487604c5aaa7da5b:0xf13a2197d7e7dd26!2m2!1d-0.1260826!2d51.5017061!3e4

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	2236302	536.0 ns/op	144 B/op	2 allocs/op	✔
JPilloraGoTld	395188	2760 ns/op	928 B/op	4 allocs/op
JoeGuoTldExtract	779056	1357 ns/op	1120 B/op	6 allocs/op
Mjd2021USATldExtract	785701	1337 ns/op	1120 B/op	6 allocs/op
M507Tlde	803700	1358 ns/op	1120 B/op	6 allocs/op

Acknowledgements

Documentation ¶

Overview ¶

Package fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs, IPv4 addresses, and IPv6 addresses.

Index ¶

Constants
type ExtractResult
type FastTLD
- func New(n SuffixListParams) (*FastTLD, error)
- func (f *FastTLD) Extract(e URLParams) *ExtractResult
- func (t *FastTLD) Update() error
type SuffixListParams
type URLParams

Constants ¶

View Source

const (
	IPv4len = 4
	IPv6len = 16
)

IP address lengths (bytes).

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ExtractResult ¶

type ExtractResult struct {
	Scheme, UserInfo, SubDomain, Domain, Suffix, Port, Path, RegisteredDomain string
}

ExtractResult contains components extracted from URL.

type FastTLD ¶ added in v0.0.2

type FastTLD struct {
	TldTrie *trie
	// contains filtered or unexported fields
}

FastTLD provides the Extract() function, to extract URLs using TldTrie generated from the Public Suffix List file at cacheFilePath.

func New ¶

func New(n SuffixListParams) (*FastTLD, error)

New creates a new *FastTLD.

func (*FastTLD) Extract ¶ added in v0.0.2

func (f *FastTLD) Extract(e URLParams) *ExtractResult

Extract components from a given `url`.

func (*FastTLD) Update ¶ added in v0.0.2

func (t *FastTLD) Update() error

Update updates the local cache of Public Suffix list if t.cacheFilePath is not the same as path to current module file (i.e. no custom file path specified).

type SuffixListParams ¶

type SuffixListParams struct {
	CacheFilePath        string
	IncludePrivateSuffix bool
}

SuffixListParams contains parameters for specifying path to Public Suffix List file and whether to extract private suffixes (e.g. blogspot.com).

type URLParams ¶ added in v0.0.2

type URLParams struct {
	URL                  string
	IgnoreSubDomains     bool
	ConvertURLToPunyCode bool
}

URLParams specifies URL to extract components from.

If IgnoreSubDomains = true, do not extract SubDomain.

If ConvertURLToPunyCode = true, convert non-ASCII characters like 世界 to punycode.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL