sitemapper

package module

v0.0.0-...-c097366 Latest Latest Go to latest Published: Mar 27, 2020 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/Matt-Esch/sitemapper

Links

Open Source Insights

README ¶

SiteMapper

Parallel web crawler implemented in Golang for producing site maps

Installation

go get -u github.com/Matt-Esch/sitemapper

Quick Start

You can use the package to read a site map from a given URL or you can compile and use the provided binary.

Basic Usage

package main

import (
  "log"
  "os"

  "github.com/Matt-Esch/sitemapper"
)

func main() {
  siteMap, err := sitemapper.CrawlDomain("https://monzo.com")
    if err != nil {
      log.Fatalf("Error: %s", err)
    }

    siteMap.WriteMap(os.Stdout)
}

Binary usage

The package provides a binary to run the crawler from the command line

go install github.com/Matt-Esch/sitemapper/cmd/sitemapper
sitemapper -u "http://todomvc.com"

http://todomvc.com
http://todomvc.com/
http://todomvc.com/examples/angular-dart/web
http://todomvc.com/examples/angular-dart/web/
http://todomvc.com/examples/angular2
http://todomvc.com/examples/angular2/
http://todomvc.com/examples/angularjs
http://todomvc.com/examples/angularjs/
http://todomvc.com/examples/angularjs_require
http://todomvc.com/examples/angularjs_require/

...

For a list of options use sitemapper -h

  -c int
        maximum concurrency (default 8)
  -d    enable debug logs
  -k duration
        http keep alive timeout (default 30s)
  -t duration
        http request timeout (default 30s)
  -u string
        url to crawl (required)
  -v    enable verbose logging
  -w duration
        maximum crawl time

Brief implementation outline

The bulk of the implementation is found in ./sitemapper.go
Tests and benchmarks are defined in ./sitemapper_test.go
A test server is defined in ./test/server and is used to create a crawlable website that listens on localhost on a random port. This website adds various traps such as pointing to external domains in order to test the crawler.
The binary to run the web crawler from the command line is defined under ./cmds/sitemapper/main.go

Design choices and limitations:

The web crawler is a parallel web crawler with bounded concurrency. A channel of URLs is consumed by a fixed number of go routines. These go routines make an http GET request to the received URL, parse it for a tags, and push previously unseen URLs into the URL channel for further consumption.
The web crawler populates the site map with new URLs before making a request to the new URL. This means that non-existent pages (404) and non-web page links (i.e. links to PDFs) will appear in the site map.
By default the logic for checking "same domain" considers just the "host" portion of the URL. The scheme (http/https) is ignored when checking same domain constraints even though this would be considered cross origin. It can be quite difficult to define a universally acceptable definition of "same domain", where some may resort to DNS lookup as the most accurate. For that reason, a sensible default is provided but it can be overridden by the caller.

License

Released under the MIT License.

Documentation ¶

Overview ¶

Package sitemapper provides a parallel site crawler for producing site maps

Index ¶

Constants
func ValidateHosts(root, link *url.URL) bool
type Config
- func NewConfig(options ...Option) *Config
- func (config *Config) Validate() error
type DomainCrawler
- func NewDomainCrawler(root *url.URL, config *Config) (*DomainCrawler, error)
- func (crawler *DomainCrawler) Crawl() (*SiteMap, error)
type DomainValidator
type DomainValidatorFunc
- func (v DomainValidatorFunc) Validate(root, link *url.URL) bool
type LinkReader
- func NewLinkReader(pageURL *url.URL, client *http.Client) *LinkReader
type Option
type SiteMap
- func (s *SiteMap) WriteMap(out io.Writer)

Constants ¶

View Source

const DefaultCrawlTimeout = time.Duration(0)

DefaultCrawlTimeout limits the total amount of time spent crawling. When 0 there is no limit.

View Source

const DefaultKeepAlive = time.Second * 30

DefaultKeepAlive is the default keepalive timeout for client connections.

View Source

const DefaultMaxConcurrency = 8

DefaultMaxConcurrency sets the number of goroutines to be used to crawl pages. This default is used to configure the transport of the default http client so that there are enough connections to support the number of goroutines used.

View Source

const DefaultMaxPendingURLS = 8192

DefaultMaxPendingURLS limits the size of the URLS list. This prevents us from increasing the URLS list faster than we can drain it. This wouldn't normally expect to happen, but there could be cases where URLs are poorly designed and contain data that changes on every page load.

View Source

const DefaultTimeout = time.Second * 10

DefaultTimeout is the default timeout used by the http client if no other timeout is specified.

Variables ¶

This section is empty.

Functions ¶

func ValidateHosts ¶

func ValidateHosts(root, link *url.URL) bool

ValidateHosts provides a default domain validation function that compares the host components of the provided URLs.

Types ¶

type Config ¶

type Config struct {
	MaxConcurrency  int
	MaxPendingURLS  int
	CrawlTimeout    time.Duration
	KeepAlive       time.Duration
	Timeout         time.Duration
	Client          *http.Client
	Logger          *zap.Logger
	DomainValidator DomainValidator
}

Config is a stuct of crawler configuration options.

func NewConfig ¶

func NewConfig(options ...Option) *Config

NewConfig creates a config from the specified options, and provides defaults for options which are not specified

func (*Config) Validate ¶

func (config *Config) Validate() error

Validate checks the configuration options for validation issues.

type DomainCrawler ¶

type DomainCrawler struct {
	// contains filtered or unexported fields
}

DomainCrawler contains the state of a domain web crawler. The domain crawler exposes a Crawl method which proudces a site map.

func NewDomainCrawler ¶

func NewDomainCrawler(root *url.URL, config *Config) (*DomainCrawler, error)

NewDomainCrawler creates a new DomainCrawler from the root url and given configuration.

func (*DomainCrawler) Crawl ¶

func (crawler *DomainCrawler) Crawl() (*SiteMap, error)

Crawl reads all links in the domain with the specified concurrency and returns a site map. Note that Crawl is not thread safe and each caller must create a separate DomainCrawler.

type DomainValidator ¶

type DomainValidator interface {
	Validate(root *url.URL, link *url.URL) bool
}

A DomainValidator provides a Validate functions for comparing two URLs for same domain inclusion. This allows for custom behavior such as checking scheme (http vs https) or DNS lookup.

type DomainValidatorFunc ¶

type DomainValidatorFunc func(root, link *url.URL) bool

DomainValidatorFunc acts as an adapter for allowing the use of ordinary functions as domain validators.

func (DomainValidatorFunc) Validate ¶

func (v DomainValidatorFunc) Validate(root, link *url.URL) bool

Validate calls v(root, link).

type LinkReader ¶

type LinkReader struct {
	// contains filtered or unexported fields
}

LinkReader is an iterative structure that allows for reading all href tags in a given URL. The link reader will make the http request to the specified url and allow for reading through all links in the returned page. When there are no more links in the page Read returns io.EOF. The consumer is responsible for closing the LinkReader when done to ensure and client http requests are cleaned up.

func NewLinkReader ¶

func NewLinkReader(pageURL *url.URL, client *http.Client) *LinkReader

NewLinkReader returns a LinkReader for the specified URL, fetching the content with the specified client

func (*LinkReader) Close ¶

func (u *LinkReader) Close() error

Close cleans up any remaining client response. If all links are read from the link reader the body will be automatically closed, however if only the first N links are required, the body must be closed by the caller.

func (*LinkReader) Read ¶

func (u *LinkReader) Read() (string, error)

Read returns the next href in the html document

func (*LinkReader) URL ¶

func (u *LinkReader) URL() string

URL returns the read-only url string that was used to make the client request

type Option ¶

type Option interface {
	// contains filtered or unexported methods
}

Option is used to configure configuration options that are not required

func SetClient ¶

func SetClient(client *http.Client) Option

SetClient overrides the default client config. Note that if the client is set, KeepAlive and Timeout will not be effective and the keep alive and timeout options set for the client will take precendence.

func SetCrawlTimeout ¶

func SetCrawlTimeout(crawlTimeout time.Duration) Option

SetCrawlTimeout sets the maximum time spent crawling URLs. When the timeout is zero or negative, no timeout is applied and the caller will wait for completion. If the timeout fires, the caller will receive the partial site map.

func SetDomainValidator ¶

func SetDomainValidator(validator DomainValidator) Option

SetDomainValidator overrides the default domain validator. The default validator is configured to compare the host component of the URLs only, not the scheme or any DNS lookups.

func SetKeepAlive ¶

func SetKeepAlive(keepAlive time.Duration) Option

SetKeepAlive sets the http client connection keep alive timeout when the default http client is used.

func SetLogger ¶

func SetLogger(logger *zap.Logger) Option

SetLogger overrides the default logger. The default logger is configured to write warning and error logs to stderr.

func SetMaxConcurrency ¶

func SetMaxConcurrency(maxConcurrency int) Option

SetMaxConcurrency sets the number of goroutines that will be used. This is also used to configure the default http client with enough open connections to support this number of goroutines.

func SetMaxPendingURLS ¶

func SetMaxPendingURLS(maxPendingURLS int) Option

SetMaxPendingURLS sets the maximum number of URLs that can persist in the queue for crawling. This will set the size of the channel of URLs being processed by the goroutines. This helps prevent cases where the number of URLs runs away indefinitely due to dynamic urls in page links.

func SetTimeout ¶

func SetTimeout(timeout time.Duration) Option

SetTimeout sets the http client request timeout when the default http client is used.

type SiteMap ¶

type SiteMap struct {
	// contains filtered or unexported fields
}

SiteMap contains the state of a site map.

func CrawlDomain ¶

func CrawlDomain(rootURL string, opts ...Option) (*SiteMap, error)

CrawlDomain crawls a domain provided as a string URL. It wraps a call to CrawlDomainWithURL.

func CrawlDomainWithURL ¶

func CrawlDomainWithURL(root *url.URL, opts ...Option) (*SiteMap, error)

CrawlDomainWithURL crawls a domain provided as a URL and returns the resulting sitemap.

func NewSiteMap ¶

func NewSiteMap(url *url.URL, validator DomainValidator) *SiteMap

NewSiteMap initializes a new SiteMap anchored at the specified URL and crawls with the specified HTTP client

func (*SiteMap) WriteMap ¶

func (s *SiteMap) WriteMap(out io.Writer)

WriteMap writes the ordered site map to a given writer.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
sitemapper
test
server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL