sitemapper

package module
v0.0.0-...-c097366 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2020 License: MIT Imports: 11 Imported by: 0

README

SiteMapper GoDoc Build Status Coverage Status

Parallel web crawler implemented in Golang for producing site maps

Installation

go get -u github.com/Matt-Esch/sitemapper

Quick Start

You can use the package to read a site map from a given URL or you can compile and use the provided binary.

Basic Usage
package main

import (
  "log"
  "os"

  "github.com/Matt-Esch/sitemapper"
)

func main() {
  siteMap, err := sitemapper.CrawlDomain("https://monzo.com")
    if err != nil {
      log.Fatalf("Error: %s", err)
    }

    siteMap.WriteMap(os.Stdout)
}
Binary usage

The package provides a binary to run the crawler from the command line

go install github.com/Matt-Esch/sitemapper/cmd/sitemapper
sitemapper -u "http://todomvc.com"

http://todomvc.com
http://todomvc.com/
http://todomvc.com/examples/angular-dart/web
http://todomvc.com/examples/angular-dart/web/
http://todomvc.com/examples/angular2
http://todomvc.com/examples/angular2/
http://todomvc.com/examples/angularjs
http://todomvc.com/examples/angularjs/
http://todomvc.com/examples/angularjs_require
http://todomvc.com/examples/angularjs_require/

...

For a list of options use sitemapper -h

  -c int
        maximum concurrency (default 8)
  -d    enable debug logs
  -k duration
        http keep alive timeout (default 30s)
  -t duration
        http request timeout (default 30s)
  -u string
        url to crawl (required)
  -v    enable verbose logging
  -w duration
        maximum crawl time

Brief implementation outline

  • The bulk of the implementation is found in ./sitemapper.go

  • Tests and benchmarks are defined in ./sitemapper_test.go

  • A test server is defined in ./test/server and is used to create a crawlable website that listens on localhost on a random port. This website adds various traps such as pointing to external domains in order to test the crawler.

  • The binary to run the web crawler from the command line is defined under ./cmds/sitemapper/main.go

Design choices and limitations:

  • The web crawler is a parallel web crawler with bounded concurrency. A channel of URLs is consumed by a fixed number of go routines. These go routines make an http GET request to the received URL, parse it for a tags, and push previously unseen URLs into the URL channel for further consumption.

  • The web crawler populates the site map with new URLs before making a request to the new URL. This means that non-existent pages (404) and non-web page links (i.e. links to PDFs) will appear in the site map.

  • By default the logic for checking "same domain" considers just the "host" portion of the URL. The scheme (http/https) is ignored when checking same domain constraints even though this would be considered cross origin. It can be quite difficult to define a universally acceptable definition of "same domain", where some may resort to DNS lookup as the most accurate. For that reason, a sensible default is provided but it can be overridden by the caller.

License

Released under the MIT License.

Documentation

Overview

Package sitemapper provides a parallel site crawler for producing site maps

Index

Constants

View Source
const DefaultCrawlTimeout = time.Duration(0)

DefaultCrawlTimeout limits the total amount of time spent crawling. When 0 there is no limit.

View Source
const DefaultKeepAlive = time.Second * 30

DefaultKeepAlive is the default keepalive timeout for client connections.

View Source
const DefaultMaxConcurrency = 8

DefaultMaxConcurrency sets the number of goroutines to be used to crawl pages. This default is used to configure the transport of the default http client so that there are enough connections to support the number of goroutines used.

View Source
const DefaultMaxPendingURLS = 8192

DefaultMaxPendingURLS limits the size of the URLS list. This prevents us from increasing the URLS list faster than we can drain it. This wouldn't normally expect to happen, but there could be cases where URLs are poorly designed and contain data that changes on every page load.

View Source
const DefaultTimeout = time.Second * 10

DefaultTimeout is the default timeout used by the http client if no other timeout is specified.

Variables

This section is empty.

Functions

func ValidateHosts

func ValidateHosts(root, link *url.URL) bool

ValidateHosts provides a default domain validation function that compares the host components of the provided URLs.

Types

type Config

type Config struct {
	MaxConcurrency  int
	MaxPendingURLS  int
	CrawlTimeout    time.Duration
	KeepAlive       time.Duration
	Timeout         time.Duration
	Client          *http.Client
	Logger          *zap.Logger
	DomainValidator DomainValidator
}

Config is a stuct of crawler configuration options.

func NewConfig

func NewConfig(options ...Option) *Config

NewConfig creates a config from the specified options, and provides defaults for options which are not specified

func (*Config) Validate

func (config *Config) Validate() error

Validate checks the configuration options for validation issues.

type DomainCrawler

type DomainCrawler struct {
	// contains filtered or unexported fields
}

DomainCrawler contains the state of a domain web crawler. The domain crawler exposes a Crawl method which proudces a site map.

func NewDomainCrawler

func NewDomainCrawler(root *url.URL, config *Config) (*DomainCrawler, error)

NewDomainCrawler creates a new DomainCrawler from the root url and given configuration.

func (*DomainCrawler) Crawl

func (crawler *DomainCrawler) Crawl() (*SiteMap, error)

Crawl reads all links in the domain with the specified concurrency and returns a site map. Note that Crawl is not thread safe and each caller must create a separate DomainCrawler.

type DomainValidator

type DomainValidator interface {
	Validate(root *url.URL, link *url.URL) bool
}

A DomainValidator provides a Validate functions for comparing two URLs for same domain inclusion. This allows for custom behavior such as checking scheme (http vs https) or DNS lookup.

type DomainValidatorFunc

type DomainValidatorFunc func(root, link *url.URL) bool

DomainValidatorFunc acts as an adapter for allowing the use of ordinary functions as domain validators.

func (DomainValidatorFunc) Validate

func (v DomainValidatorFunc) Validate(root, link *url.URL) bool

Validate calls v(root, link).

type LinkReader

type LinkReader struct {
	// contains filtered or unexported fields
}

LinkReader is an iterative structure that allows for reading all href tags in a given URL. The link reader will make the http request to the specified url and allow for reading through all links in the returned page. When there are no more links in the page Read returns io.EOF. The consumer is responsible for closing the LinkReader when done to ensure and client http requests are cleaned up.

func NewLinkReader

func NewLinkReader(pageURL *url.URL, client *http.Client) *LinkReader

NewLinkReader returns a LinkReader for the specified URL, fetching the content with the specified client

func (*LinkReader) Close

func (u *LinkReader) Close() error

Close cleans up any remaining client response. If all links are read from the link reader the body will be automatically closed, however if only the first N links are required, the body must be closed by the caller.

func (*LinkReader) Read

func (u *LinkReader) Read() (string, error)

Read returns the next href in the html document

func (*LinkReader) URL

func (u *LinkReader) URL() string

URL returns the read-only url string that was used to make the client request

type Option

type Option interface {
	// contains filtered or unexported methods
}

Option is used to configure configuration options that are not required

func SetClient

func SetClient(client *http.Client) Option

SetClient overrides the default client config. Note that if the client is set, KeepAlive and Timeout will not be effective and the keep alive and timeout options set for the client will take precendence.

func SetCrawlTimeout

func SetCrawlTimeout(crawlTimeout time.Duration) Option

SetCrawlTimeout sets the maximum time spent crawling URLs. When the timeout is zero or negative, no timeout is applied and the caller will wait for completion. If the timeout fires, the caller will receive the partial site map.

func SetDomainValidator

func SetDomainValidator(validator DomainValidator) Option

SetDomainValidator overrides the default domain validator. The default validator is configured to compare the host component of the URLs only, not the scheme or any DNS lookups.

func SetKeepAlive

func SetKeepAlive(keepAlive time.Duration) Option

SetKeepAlive sets the http client connection keep alive timeout when the default http client is used.

func SetLogger

func SetLogger(logger *zap.Logger) Option

SetLogger overrides the default logger. The default logger is configured to write warning and error logs to stderr.

func SetMaxConcurrency

func SetMaxConcurrency(maxConcurrency int) Option

SetMaxConcurrency sets the number of goroutines that will be used. This is also used to configure the default http client with enough open connections to support this number of goroutines.

func SetMaxPendingURLS

func SetMaxPendingURLS(maxPendingURLS int) Option

SetMaxPendingURLS sets the maximum number of URLs that can persist in the queue for crawling. This will set the size of the channel of URLs being processed by the goroutines. This helps prevent cases where the number of URLs runs away indefinitely due to dynamic urls in page links.

func SetTimeout

func SetTimeout(timeout time.Duration) Option

SetTimeout sets the http client request timeout when the default http client is used.

type SiteMap

type SiteMap struct {
	// contains filtered or unexported fields
}

SiteMap contains the state of a site map.

func CrawlDomain

func CrawlDomain(rootURL string, opts ...Option) (*SiteMap, error)

CrawlDomain crawls a domain provided as a string URL. It wraps a call to CrawlDomainWithURL.

func CrawlDomainWithURL

func CrawlDomainWithURL(root *url.URL, opts ...Option) (*SiteMap, error)

CrawlDomainWithURL crawls a domain provided as a URL and returns the resulting sitemap.

func NewSiteMap

func NewSiteMap(url *url.URL, validator DomainValidator) *SiteMap

NewSiteMap initializes a new SiteMap anchored at the specified URL and crawls with the specified HTTP client

func (*SiteMap) WriteMap

func (s *SiteMap) WriteMap(out io.Writer)

WriteMap writes the ordered site map to a given writer.

Directories

Path Synopsis
cmd
test

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL