crawler

package module
v1.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 13, 2021 License: MIT Imports: 6 Imported by: 0

README

go-crawler

GoDoc Go Report Card Build Status codecov

The library that implements crawling of all relative links for specified ones.

Features

  • crawling of all relative links for specified ones:
    • names of tags and attributes of links may be configured;
    • supporting of an outer transformer for the extracted links (optional):
      • data passed to the transformer:
        • extracted links;
        • service data of the HTTP response;
        • content of the HTTP response as bytes;
      • transformers:
        • leading and trailing spaces trimming in the extracted links;
        • resolving of relative links:
          • by the base tag:
            • tag and attribute names may be configured (<base href="..." /> by default);
            • tag selection:
              • first occurrence;
              • last occurrence;
          • by the header list:
            • the headers are listed in the descending order of the priority;
            • Content-Base and Content-Location by default;
          • by the request URI;
      • supporting of grouping of transformers:
        • the transformers are processed sequentially, so one transformer can influence another one;
    • supporting of leading and trailing spaces trimming in extracted links (optional):
      • as the transformer for the extracted links (see above);
      • as the wrapper for a link extractor;
    • repeated extracting of relative links on error (optional):
      • only the specified repeat count;
      • supporting of a delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • in-memory caching of the loaded sitemap.xml files;
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of the empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for the sitemap.xml links:
          • generators:
            • hierarchical generator:
              • returns the suitable sitemap.xml file for each part of the URL path;
              • supporting of sanitizing of the base link before generating of the sitemap.xml links;
              • supporting of the restriction of the maximal depth;
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each link extractor in the group;
      • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for each extracted link:
    • handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
    • data passed to the handler:
      • extracted link;
      • source link for the extracted link;
    • handling only of those extracted links that have been filtered by a link filter (see below; optional);
    • handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
    • supporting of grouping of handlers:
      • processing of each handler is done in a separate goroutine;
  • filtering of the extracted links by an outer link filter:
    • by relativity of the extracted link (optional):
      • supporting of result inverting;
    • by uniqueness of the extracted link (optional):
      • supporting of sanitizing of the link before checking of uniqueness;
    • by a robots.txt file (optional):
      • customized user agent;
      • in-memory caching of the loaded robots.txt files;
    • supporting of grouping of link filters:
      • the link filters are processed sequentially, so one link filter can influence another one;
      • result of group filtering is successful only when all link filters are successful;
      • the empty group of link filters is always failed;
  • parallelization possibilities:
    • crawling of relative links concurrently, i.e., in the goroutine pool;
    • simulation of an unbounded channel of links to avoid a deadlock;
    • waiting of completion of processing of all extracted links;
    • supporting of stopping of all operations via the context.

Installation

Prepare the directory:

$ mkdir --parents "$(go env GOPATH)/src/github.com/thewizardplusplus/"
$ cd "$(go env GOPATH)/src/github.com/thewizardplusplus/"

Clone this repository:

$ git clone https://github.com/thewizardplusplus/go-crawler.git
$ cd go-crawler

Install dependencies with the dep tool:

$ dep ensure -vendor-only

Examples

crawler.Crawl() with all the features:

package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"
	"time"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	"github.com/thewizardplusplus/go-crawler/registers/sitemap"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf( // nolint: errcheck
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":
			// render the empty Sitemap to escape the error logging
			// for reproducibility of the example
			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close() // nolint: errcheck

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text) // nolint: errcheck
	template.Execute(writer, data)              // nolint: errcheck
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	robotsTXTRegister := registers.NewRobotsTXTRegister(http.DefaultClient)
	crawler.CrawlByConcurrentHandler(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.NewDelayingExtractor(
				time.Second,
				time.Sleep,
				extractors.ExtractorGroup{
					Name: "main extractors",
					LinkExtractors: []models.LinkExtractor{
						extractors.RepeatingExtractor{
							LinkExtractor: extractors.DefaultExtractor{
								HTTPClient: http.DefaultClient,
								Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
									"a": {"href"},
								}),
								LinkTransformer: transformers.TransformerGroup{
									transformers.TrimmingTransformer{
										TrimLink: urlutils.TrimLink,
									},
									transformers.ResolvingTransformer{
										BaseTagSelection: transformers.SelectFirstBaseTag,
										BaseTagFilters:   transformers.DefaultBaseTagFilters,
										BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
										Logger:           wrappedLogger,
									},
								},
							},
							RepeatCount:  5,
							RepeatDelay:  time.Second,
							Logger:       wrappedLogger,
							SleepHandler: time.Sleep,
						},
						extractors.RepeatingExtractor{
							LinkExtractor: extractors.TrimmingExtractor{
								TrimLink: urlutils.TrimLink,
								LinkExtractor: extractors.SitemapExtractor{
									SitemapRegister: registers.NewSitemapRegister(
										time.Second,
										extractors.ExtractorGroup{
											Name: "extractors of Sitemap links",
											LinkExtractors: []models.LinkExtractor{
												sitemap.HierarchicalGenerator{
													SanitizeLink: urlutils.SanitizeLink,
													MaximalDepth: -1,
												},
												sitemap.RobotsTXTGenerator{
													RobotsTXTRegister: robotsTXTRegister,
												},
											},
											Logger: wrappedLogger,
										},
										wrappedLogger,
										sitemap.Loader{HTTPClient: http.DefaultClient}.LoadLink,
									),
									Logger: wrappedLogger,
								},
							},
							RepeatCount:  5,
							RepeatDelay:  time.Second,
							Logger:       wrappedLogger,
							SleepHandler: time.Sleep,
						},
					},
					Logger: wrappedLogger,
				},
			),
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					Logger: wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				checkers.RobotsTXTChecker{
					UserAgent:         "go-crawler",
					RobotsTXTRegister: robotsTXTRegister,
					Logger:            wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: handlers.HandlerGroup{
					handlers.CheckedHandler{
						LinkChecker: checkers.HostChecker{
							ComparisonResult: urlutils.Same,
							Logger:           wrappedLogger,
						},
						LinkHandler: LinkHandler{
							Name:      "inner",
							ServerURL: server.URL,
						},
					},
					handlers.CheckedHandler{
						LinkChecker: checkers.HostChecker{
							ComparisonResult: urlutils.Different,
							Logger:           wrappedLogger,
						},
						LinkHandler: LinkHandler{
							Name:      "outer",
							ServerURL: server.URL,
						},
					},
				},
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// [inner] received link "http://example.com/1" from page "http://example.com"
	// [inner] received link "http://example.com/1/1" from page "http://example.com/1"
	// [inner] received link "http://example.com/1/2" from page "http://example.com/1"
	// [inner] received link "http://example.com/2" from page "http://example.com"
	// [inner] received link "http://example.com/hidden/1" from page "http://example.com"
	// [inner] received link "http://example.com/hidden/1/test" from page "http://example.com/hidden/1"
	// [inner] received link "http://example.com/hidden/2" from page "http://example.com"
	// [inner] received link "http://example.com/hidden/3" from page "http://example.com"
	// [inner] received link "http://example.com/hidden/4" from page "http://example.com"
	// [inner] received link "http://example.com/hidden/5" from page "http://example.com/hidden/1/test"
	// [inner] received link "http://example.com/hidden/6" from page "http://example.com/hidden/1/test"
	// [outer] received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl():

package main

import (
	"context"
	"fmt"
	"html/template"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"runtime"
	"strings"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"received link %q from page %q\n",
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		var links []string
		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		}

		template, _ := template.New("").Parse( // nolint: errcheck
			`<ul>
				{{ range $link := . }}
					<li><a href="{{ $link }}">{{ $link }}</a></li>
				{{ end }}
			</ul>`,
		)
		template.Execute(writer, links) // nolint: errcheck
	}))
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.HostChecker{
				ComparisonResult: urlutils.Same,
				Logger:           wrappedLogger,
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// received link "http://example.com/1" from page "http://example.com"
	// received link "http://example.com/1/1" from page "http://example.com/1"
	// received link "http://example.com/1/2" from page "http://example.com/1"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2/1" from page "http://example.com/2"
	// received link "http://example.com/2/1" from page "http://example.com/2"
	// received link "http://example.com/2/2" from page "http://example.com/2"
	// received link "http://example.com/2/2" from page "http://example.com/2"
	// received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl() without duplicates on extracting:

package main

import (
	"context"
	"fmt"
	"html/template"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"runtime"
	"strings"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"received link %q from page %q\n",
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		var links []string
		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		}

		template, _ := template.New("").Parse( // nolint: errcheck
			`<ul>
				{{ range $link := . }}
					<li><a href="{{ $link }}">{{ $link }}</a></li>
				{{ end }}
			</ul>`,
		)
		template.Execute(writer, links) // nolint: errcheck
	}))
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// received link "http://example.com/1" from page "http://example.com"
	// received link "http://example.com/1/1" from page "http://example.com/1"
	// received link "http://example.com/1/2" from page "http://example.com/1"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2/1" from page "http://example.com/2"
	// received link "http://example.com/2/2" from page "http://example.com/2"
	// received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl() without duplicates on handling:

package main

import (
	"context"
	"fmt"
	"html/template"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"runtime"
	"strings"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"received link %q from page %q\n",
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		var links []string
		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		}

		template, _ := template.New("").Parse( // nolint: errcheck
			`<ul>
				{{ range $link := . }}
					<li><a href="{{ $link }}">{{ $link }}</a></li>
				{{ end }}
			</ul>`,
		)
		template.Execute(writer, links) // nolint: errcheck
	}))
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: LinkHandler{
					ServerURL: server.URL,
				},
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// received link "http://example.com/1" from page "http://example.com"
	// received link "http://example.com/1/1" from page "http://example.com/1"
	// received link "http://example.com/1/2" from page "http://example.com/1"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2/1" from page "http://example.com/2"
	// received link "http://example.com/2/2" from page "http://example.com/2"
	// received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl() with processing a robots.txt file:

package main

import (
	"context"
	"fmt"
	"html/template"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"runtime"
	"strings"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"received link %q from page %q\n",
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			fmt.Fprint( // nolint: errcheck
				writer,
				`
					User-agent: go-crawler
					Disallow: /2
				`,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		}

		template, _ := template.New("").Parse( // nolint: errcheck
			`<ul>
				{{ range $link := . }}
					<li><a href="{{ $link }}">{{ $link }}</a></li>
				{{ end }}
			</ul>`,
		)
		template.Execute(writer, links) // nolint: errcheck
	}))
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.RobotsTXTChecker{
					UserAgent:         "go-crawler",
					RobotsTXTRegister: registers.NewRobotsTXTRegister(http.DefaultClient),
					Logger:            wrappedLogger,
				},
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// received link "http://example.com/1" from page "http://example.com"
	// received link "http://example.com/1/1" from page "http://example.com/1"
	// received link "http://example.com/1/2" from page "http://example.com/1"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl() with processing a sitemap.xml file:

package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"
	"time"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	"github.com/thewizardplusplus/go-crawler/registers/sitemap"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"received link %q from page %q\n",
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf( // nolint: errcheck
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":
			// render the empty Sitemap to escape the error logging
			// for reproducibility of the example
			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close() // nolint: errcheck

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text) // nolint: errcheck
	template.Execute(writer, data)              // nolint: errcheck
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.ExtractorGroup{
				Name: "main extractors",
				LinkExtractors: []models.LinkExtractor{
					extractors.DefaultExtractor{
						HTTPClient: http.DefaultClient,
						Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
							"a": {"href"},
						}),
						LinkTransformer: transformers.ResolvingTransformer{
							BaseTagSelection: transformers.SelectFirstBaseTag,
							BaseTagFilters:   transformers.DefaultBaseTagFilters,
							BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
							Logger:           wrappedLogger,
						},
					},
					extractors.SitemapExtractor{
						SitemapRegister: registers.NewSitemapRegister(
							time.Second,
							extractors.ExtractorGroup{
								Name: "extractors of Sitemap links",
								LinkExtractors: []models.LinkExtractor{
									sitemap.HierarchicalGenerator{
										SanitizeLink: urlutils.SanitizeLink,
										MaximalDepth: -1,
									},
									sitemap.RobotsTXTGenerator{
										RobotsTXTRegister: registers.NewRobotsTXTRegister(
											http.DefaultClient,
										),
									},
								},
								Logger: wrappedLogger,
							},
							wrappedLogger,
							sitemap.Loader{HTTPClient: http.DefaultClient}.LoadLink,
						),
						Logger: wrappedLogger,
					},
				},
				Logger: wrappedLogger,
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: LinkHandler{
					ServerURL: server.URL,
				},
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// received link "http://example.com/1" from page "http://example.com"
	// received link "http://example.com/1/1" from page "http://example.com/1"
	// received link "http://example.com/1/2" from page "http://example.com/1"
	// received link "http://example.com/2" from page "http://example.com"
	// received link "http://example.com/2/1" from page "http://example.com/2"
	// received link "http://example.com/2/2" from page "http://example.com/2"
	// received link "http://example.com/hidden/1" from page "http://example.com"
	// received link "http://example.com/hidden/1/test" from page "http://example.com/hidden/1"
	// received link "http://example.com/hidden/2" from page "http://example.com"
	// received link "http://example.com/hidden/3" from page "http://example.com"
	// received link "http://example.com/hidden/4" from page "http://example.com"
	// received link "http://example.com/hidden/5" from page "http://example.com/hidden/1/test"
	// received link "http://example.com/hidden/6" from page "http://example.com/hidden/1/test"
	// received link "https://golang.org/" from page "http://example.com"
}

crawler.Crawl() with few handlers:

package main

import (
	"context"
	"fmt"
	"html/template"
	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"runtime"
	"strings"

	"github.com/go-log/log/print"
	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"
	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	fmt.Printf(
		"[%s] received link %q from page %q\n",
		handler.Name,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		var links []string
		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		}

		template, _ := template.New("").Parse( // nolint: errcheck
			`<ul>
				{{ range $link := . }}
					<li><a href="{{ $link }}">{{ $link }}</a></li>
				{{ end }}
			</ul>`,
		)
		template.Execute(writer, links) // nolint: errcheck
	}))
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.HostChecker{
				ComparisonResult: urlutils.Same,
				Logger:           wrappedLogger,
			},
			LinkHandler: handlers.HandlerGroup{
				handlers.CheckedHandler{
					LinkChecker: checkers.HostChecker{
						ComparisonResult: urlutils.Same,
						Logger:           wrappedLogger,
					},
					LinkHandler: LinkHandler{
						Name:      "inner",
						ServerURL: server.URL,
					},
				},
				handlers.CheckedHandler{
					LinkChecker: checkers.HostChecker{
						ComparisonResult: urlutils.Different,
						Logger:           wrappedLogger,
					},
					LinkHandler: LinkHandler{
						Name:      "outer",
						ServerURL: server.URL,
					},
				},
			},
			Logger: wrappedLogger,
		},
	)

	// Unordered output:
	// [inner] received link "http://example.com/1" from page "http://example.com"
	// [inner] received link "http://example.com/1/1" from page "http://example.com/1"
	// [inner] received link "http://example.com/1/2" from page "http://example.com/1"
	// [inner] received link "http://example.com/2" from page "http://example.com"
	// [inner] received link "http://example.com/2" from page "http://example.com"
	// [inner] received link "http://example.com/2/1" from page "http://example.com/2"
	// [inner] received link "http://example.com/2/1" from page "http://example.com/2"
	// [inner] received link "http://example.com/2/2" from page "http://example.com/2"
	// [inner] received link "http://example.com/2/2" from page "http://example.com/2"
	// [outer] received link "https://golang.org/" from page "http://example.com"
}

Bibliography

License

The MIT License (MIT)

Copyright © 2020-2021 thewizardplusplus

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Crawl

func Crawl(
	ctx context.Context,
	concurrencyConfig ConcurrencyConfig,
	links []string,
	dependencies CrawlDependencies,
)

Crawl ...

Example
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.HostChecker{
				ComparisonResult: urlutils.Same,
				Logger:           wrappedLogger,
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

received link "http://example.com/1" from page "http://example.com"
received link "http://example.com/1/1" from page "http://example.com/1"
received link "http://example.com/1/2" from page "http://example.com/1"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2/1" from page "http://example.com/2"
received link "http://example.com/2/1" from page "http://example.com/2"
received link "http://example.com/2/2" from page "http://example.com/2"
received link "http://example.com/2/2" from page "http://example.com/2"
received link "https://golang.org/" from page "http://example.com"
Example (WithAllFeatures)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"
	"time"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	"github.com/thewizardplusplus/go-crawler/registers/sitemap"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	robotsTXTRegister := registers.NewRobotsTXTRegister(http.DefaultClient)
	crawler.CrawlByConcurrentHandler(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.NewDelayingExtractor(
				time.Second,
				time.Sleep,
				extractors.ExtractorGroup{
					Name: "main extractors",
					LinkExtractors: []models.LinkExtractor{
						extractors.RepeatingExtractor{
							LinkExtractor: extractors.DefaultExtractor{
								HTTPClient: http.DefaultClient,
								Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
									"a": {"href"},
								}),
								LinkTransformer: transformers.TransformerGroup{
									transformers.TrimmingTransformer{
										TrimLink: urlutils.TrimLink,
									},
									transformers.ResolvingTransformer{
										BaseTagSelection: transformers.SelectFirstBaseTag,
										BaseTagFilters:   transformers.DefaultBaseTagFilters,
										BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
										Logger:           wrappedLogger,
									},
								},
							},
							RepeatCount:  5,
							RepeatDelay:  time.Second,
							Logger:       wrappedLogger,
							SleepHandler: time.Sleep,
						},
						extractors.RepeatingExtractor{
							LinkExtractor: extractors.TrimmingExtractor{
								TrimLink: urlutils.TrimLink,
								LinkExtractor: extractors.SitemapExtractor{
									SitemapRegister: registers.NewSitemapRegister(
										time.Second,
										extractors.ExtractorGroup{
											Name: "extractors of Sitemap links",
											LinkExtractors: []models.LinkExtractor{
												sitemap.HierarchicalGenerator{
													SanitizeLink: urlutils.SanitizeLink,
													MaximalDepth: -1,
												},
												sitemap.RobotsTXTGenerator{
													RobotsTXTRegister: robotsTXTRegister,
												},
											},
											Logger: wrappedLogger,
										},
										wrappedLogger,
										sitemap.Loader{HTTPClient: http.DefaultClient}.LoadLink,
									),
									Logger: wrappedLogger,
								},
							},
							RepeatCount:  5,
							RepeatDelay:  time.Second,
							Logger:       wrappedLogger,
							SleepHandler: time.Sleep,
						},
					},
					Logger: wrappedLogger,
				},
			),
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					Logger: wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				checkers.RobotsTXTChecker{
					UserAgent:         "go-crawler",
					RobotsTXTRegister: robotsTXTRegister,
					Logger:            wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: handlers.HandlerGroup{
					handlers.CheckedHandler{
						LinkChecker: checkers.HostChecker{
							ComparisonResult: urlutils.Same,
							Logger:           wrappedLogger,
						},
						LinkHandler: LinkHandler{
							Name:      "inner",
							ServerURL: server.URL,
						},
					},
					handlers.CheckedHandler{
						LinkChecker: checkers.HostChecker{
							ComparisonResult: urlutils.Different,
							Logger:           wrappedLogger,
						},
						LinkHandler: LinkHandler{
							Name:      "outer",
							ServerURL: server.URL,
						},
					},
				},
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

[inner] received link "http://example.com/1" from page "http://example.com"
[inner] received link "http://example.com/1/1" from page "http://example.com/1"
[inner] received link "http://example.com/1/2" from page "http://example.com/1"
[inner] received link "http://example.com/2" from page "http://example.com"
[inner] received link "http://example.com/hidden/1" from page "http://example.com"
[inner] received link "http://example.com/hidden/1/test" from page "http://example.com/hidden/1"
[inner] received link "http://example.com/hidden/2" from page "http://example.com"
[inner] received link "http://example.com/hidden/3" from page "http://example.com"
[inner] received link "http://example.com/hidden/4" from page "http://example.com"
[inner] received link "http://example.com/hidden/5" from page "http://example.com/hidden/1/test"
[inner] received link "http://example.com/hidden/6" from page "http://example.com/hidden/1/test"
[outer] received link "https://golang.org/" from page "http://example.com"
Example (WithFewHandlers)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.HostChecker{
				ComparisonResult: urlutils.Same,
				Logger:           wrappedLogger,
			},
			LinkHandler: handlers.HandlerGroup{
				handlers.CheckedHandler{
					LinkChecker: checkers.HostChecker{
						ComparisonResult: urlutils.Same,
						Logger:           wrappedLogger,
					},
					LinkHandler: LinkHandler{
						Name:      "inner",
						ServerURL: server.URL,
					},
				},
				handlers.CheckedHandler{
					LinkChecker: checkers.HostChecker{
						ComparisonResult: urlutils.Different,
						Logger:           wrappedLogger,
					},
					LinkHandler: LinkHandler{
						Name:      "outer",
						ServerURL: server.URL,
					},
				},
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

[inner] received link "http://example.com/1" from page "http://example.com"
[inner] received link "http://example.com/1/1" from page "http://example.com/1"
[inner] received link "http://example.com/1/2" from page "http://example.com/1"
[inner] received link "http://example.com/2" from page "http://example.com"
[inner] received link "http://example.com/2" from page "http://example.com"
[inner] received link "http://example.com/2/1" from page "http://example.com/2"
[inner] received link "http://example.com/2/1" from page "http://example.com/2"
[inner] received link "http://example.com/2/2" from page "http://example.com/2"
[inner] received link "http://example.com/2/2" from page "http://example.com/2"
[outer] received link "https://golang.org/" from page "http://example.com"
Example (WithRobotsTXT)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.RobotsTXTChecker{
					UserAgent:         "go-crawler",
					RobotsTXTRegister: registers.NewRobotsTXTRegister(http.DefaultClient),
					Logger:            wrappedLogger,
				},
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

received link "http://example.com/1" from page "http://example.com"
received link "http://example.com/1/1" from page "http://example.com/1"
received link "http://example.com/1/2" from page "http://example.com/1"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2" from page "http://example.com"
received link "https://golang.org/" from page "http://example.com"
Example (WithSitemap)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"
	"time"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"
	"github.com/thewizardplusplus/go-crawler/registers/sitemap"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.ExtractorGroup{
				Name: "main extractors",
				LinkExtractors: []models.LinkExtractor{
					extractors.DefaultExtractor{
						HTTPClient: http.DefaultClient,
						Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
							"a": {"href"},
						}),
						LinkTransformer: transformers.ResolvingTransformer{
							BaseTagSelection: transformers.SelectFirstBaseTag,
							BaseTagFilters:   transformers.DefaultBaseTagFilters,
							BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
							Logger:           wrappedLogger,
						},
					},
					extractors.SitemapExtractor{
						SitemapRegister: registers.NewSitemapRegister(
							time.Second,
							extractors.ExtractorGroup{
								Name: "extractors of Sitemap links",
								LinkExtractors: []models.LinkExtractor{
									sitemap.HierarchicalGenerator{
										SanitizeLink: urlutils.SanitizeLink,
										MaximalDepth: -1,
									},
									sitemap.RobotsTXTGenerator{
										RobotsTXTRegister: registers.NewRobotsTXTRegister(
											http.DefaultClient,
										),
									},
								},
								Logger: wrappedLogger,
							},
							wrappedLogger,
							sitemap.Loader{HTTPClient: http.DefaultClient}.LoadLink,
						),
						Logger: wrappedLogger,
					},
				},
				Logger: wrappedLogger,
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: LinkHandler{
					ServerURL: server.URL,
				},
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

received link "http://example.com/1" from page "http://example.com"
received link "http://example.com/1/1" from page "http://example.com/1"
received link "http://example.com/1/2" from page "http://example.com/1"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2/1" from page "http://example.com/2"
received link "http://example.com/2/2" from page "http://example.com/2"
received link "http://example.com/hidden/1" from page "http://example.com"
received link "http://example.com/hidden/1/test" from page "http://example.com/hidden/1"
received link "http://example.com/hidden/2" from page "http://example.com"
received link "http://example.com/hidden/3" from page "http://example.com"
received link "http://example.com/hidden/4" from page "http://example.com"
received link "http://example.com/hidden/5" from page "http://example.com/hidden/1/test"
received link "http://example.com/hidden/6" from page "http://example.com/hidden/1/test"
received link "https://golang.org/" from page "http://example.com"
Example (WithoutDuplicatesOnExtracting)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: LinkHandler{
				ServerURL: server.URL,
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

received link "http://example.com/1" from page "http://example.com"
received link "http://example.com/1/1" from page "http://example.com/1"
received link "http://example.com/1/2" from page "http://example.com/1"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2/1" from page "http://example.com/2"
received link "http://example.com/2/2" from page "http://example.com/2"
received link "https://golang.org/" from page "http://example.com"
Example (WithoutDuplicatesOnHandling)
package main

import (
	"compress/gzip"
	"context"
	"fmt"
	"html/template"
	"io"

	stdlog "log"
	"net/http"
	"net/http/httptest"
	"os"
	"path"
	"runtime"
	"strings"

	crawler "github.com/thewizardplusplus/go-crawler"
	"github.com/thewizardplusplus/go-crawler/checkers"
	"github.com/thewizardplusplus/go-crawler/extractors"
	"github.com/thewizardplusplus/go-crawler/extractors/transformers"
	"github.com/thewizardplusplus/go-crawler/handlers"
	"github.com/thewizardplusplus/go-crawler/models"
	"github.com/thewizardplusplus/go-crawler/registers"

	urlutils "github.com/thewizardplusplus/go-crawler/url-utils"

	htmlselector "github.com/thewizardplusplus/go-html-selector"
)

type LinkHandler struct {
	Name      string
	ServerURL string
}

func (handler LinkHandler) HandleLink(
	ctx context.Context,
	link models.SourcedLink,
) {
	var prefix string
	if handler.Name != "" {
		prefix = fmt.Sprintf("[%s] ", handler.Name)
	}

	fmt.Printf(
		"%sreceived link %q from page %q\n",
		prefix,
		handler.replaceServerURL(link.Link),
		handler.replaceServerURL(link.SourceLink),
	)
}

// replace the test server URL for reproducibility of the example
func (handler LinkHandler) replaceServerURL(link string) string {
	return strings.Replace(link, handler.ServerURL, "http://example.com", -1)
}

// nolint: gocyclo
func RunServer() *httptest.Server {
	return httptest.NewServer(http.HandlerFunc(func(
		writer http.ResponseWriter,
		request *http.Request,
	) {
		if request.URL.Path == "/robots.txt" {
			sitemapLink :=
				completeLinkWithHost("/sitemap_from_robots_txt.xml", request.Host)
			fmt.Fprintf(
				writer,
				`
					User-agent: go-crawler
					Disallow: /2

					Sitemap: %s
				`,
				sitemapLink,
			)

			return
		}

		var links []string
		switch request.URL.Path {
		case "/sitemap.xml":
			links = []string{"/1", "/2", "/hidden/1", "/hidden/2"}
		case "/sitemap_from_robots_txt.xml":
			links = []string{"/hidden/3", "/hidden/4"}
		case "/hidden/1/sitemap.xml":
			links = []string{"/hidden/5", "/hidden/6"}
		case "/1/sitemap.xml", "/2/sitemap.xml", "/hidden/sitemap.xml":

			links = []string{}
		}
		for index := range links {
			links[index] = completeLinkWithHost(links[index], request.Host)
		}

		if links != nil {
			writer.Header().Set("Content-Encoding", "gzip")

			compressingWriter := gzip.NewWriter(writer)
			defer compressingWriter.Close()

			renderTemplate(compressingWriter, links, `
				<?xml version="1.0" encoding="UTF-8" ?>
				<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
					{{ range $link := . }}
						<url>
							<loc>{{ $link }}</loc>
						</url>
					{{ end }}
				</urlset>
			`)

			return
		}

		switch request.URL.Path {
		case "/":
			links = []string{"/1", "/2", "/2", "https://golang.org/"}
		case "/1":
			links = []string{"/1/1", "/1/2"}
		case "/2":
			links = []string{"/2/1", "/2/2"}
		case "/hidden/1":
			links = []string{"/hidden/1/test"}
		}

		renderTemplate(writer, links, `
			<ul>
				{{ range $link := . }}
					<li>
						<a href="{{ $link }}">{{ $link }}</a>
					</li>
				{{ end }}
			</ul>
		`)
	}))
}

func completeLinkWithHost(link string, host string) string {
	return "http://" + path.Join(host, link)
}

func renderTemplate(writer io.Writer, data interface{}, text string) {
	template, _ := template.New("").Parse(text)
	template.Execute(writer, data)
}

func main() {
	server := RunServer()
	defer server.Close()

	logger := stdlog.New(os.Stderr, "", stdlog.LstdFlags|stdlog.Lmicroseconds)
	// wrap the standard logger via the github.com/go-log/log package
	wrappedLogger := print.New(logger)

	crawler.Crawl(
		context.Background(),
		crawler.ConcurrencyConfig{
			ConcurrencyFactor: runtime.NumCPU(),
			BufferSize:        1000,
		},
		[]string{server.URL},
		crawler.CrawlDependencies{
			LinkExtractor: extractors.DefaultExtractor{
				HTTPClient: http.DefaultClient,
				Filters: htmlselector.OptimizeFilters(htmlselector.FilterGroup{
					"a": {"href"},
				}),
				LinkTransformer: transformers.ResolvingTransformer{
					BaseTagSelection: transformers.SelectFirstBaseTag,
					BaseTagFilters:   transformers.DefaultBaseTagFilters,
					BaseHeaderNames:  urlutils.DefaultBaseHeaderNames,
					Logger:           wrappedLogger,
				},
			},
			LinkChecker: checkers.CheckerGroup{
				checkers.HostChecker{
					ComparisonResult: urlutils.Same,
					Logger:           wrappedLogger,
				},
				checkers.DuplicateChecker{
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
			},
			LinkHandler: handlers.CheckedHandler{
				LinkChecker: checkers.DuplicateChecker{
					// don't use here the link register from the duplicate checker above
					LinkRegister: registers.NewLinkRegister(urlutils.SanitizeLink),
					Logger:       wrappedLogger,
				},
				LinkHandler: LinkHandler{
					ServerURL: server.URL,
				},
			},
			Logger: wrappedLogger,
		},
	)

}
Output:

received link "http://example.com/1" from page "http://example.com"
received link "http://example.com/1/1" from page "http://example.com/1"
received link "http://example.com/1/2" from page "http://example.com/1"
received link "http://example.com/2" from page "http://example.com"
received link "http://example.com/2/1" from page "http://example.com/2"
received link "http://example.com/2/2" from page "http://example.com/2"
received link "https://golang.org/" from page "http://example.com"

func CrawlByConcurrentHandler added in v1.7.1

func CrawlByConcurrentHandler(
	ctx context.Context,
	concurrencyConfig ConcurrencyConfig,
	handlerConcurrencyConfig ConcurrencyConfig,
	links []string,
	dependencies CrawlDependencies,
)

CrawlByConcurrentHandler ...

func HandleLink(
	ctx context.Context,
	threadID int,
	link string,
	dependencies HandleLinkDependencies,
) []string

HandleLink ...

func HandleLinks(
	ctx context.Context,
	threadID int,
	links chan string,
	dependencies HandleLinkDependencies,
)

HandleLinks ...

func HandleLinksConcurrently

func HandleLinksConcurrently(
	ctx context.Context,
	concurrencyFactor int,
	links chan string,
	dependencies HandleLinkDependencies,
)

HandleLinksConcurrently ...

Types

type ConcurrencyConfig added in v1.7.1

type ConcurrencyConfig struct {
	ConcurrencyFactor int
	BufferSize        int
}

ConcurrencyConfig ...

type CrawlDependencies

type CrawlDependencies struct {
	LinkExtractor models.LinkExtractor
	LinkChecker   models.LinkChecker
	LinkHandler   models.LinkHandler
	Logger        log.Logger
}

CrawlDependencies ...

type HandleLinkDependencies

type HandleLinkDependencies struct {
	CrawlDependencies

	Waiter syncutils.WaitGroup
}

HandleLinkDependencies ...

type LinkChecker

type LinkChecker interface {
	models.LinkChecker
}

LinkChecker ...

It's used only for mock generating.

type LinkExtractor

type LinkExtractor interface {
	models.LinkExtractor
}

LinkExtractor ...

It's used only for mock generating.

type LinkHandler

type LinkHandler interface {
	models.LinkHandler
}

LinkHandler ...

It's used only for mock generating.

type Logger

type Logger interface {
	log.Logger
}

Logger ...

It's used only for mock generating.

type Waiter

type Waiter interface {
	syncutils.WaitGroup
}

Waiter ...

It's used only for mock generating.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL