crawl

package module

v0.0.0-...-77b4e06 Latest Latest Go to latest Published: Nov 25, 2019 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bytemare/crawl

Links

Open Source Insights

README ¶

crawl

The crawler scraps a page for links, follows them and scrapes them in the same fashion.

You can launch the app with or without a timeout (in seconds), like this :

go run app/crawl.go (-timeout=10) https://bytema.re

However the program was launched, you can interrupt it with ctrl+c.

Features

single domain scope
parallel scrawling
optional timeout
scraps queries and fragments from url
avoid loops on already visited links
usable as a package by calling FetchLinks(), StreamLinks() and ScrapLinks() functions
logs to file in JSON for log aggregation

Get the Crawler : Installation and update

It's as easy as it gets with Go :

go get -u github.com/bytemare/crawl

Usage and examples

The scraper and crawler functions are rather easy to use. The timeout parameter is optional, if you don't need to timeout, just set it to 0.

Calling the crawler from your code

You can call the crawler from your own code with StreamLinks or FetchLinks.

StreamLinks returns a channel you can listen on for continuous results as they arrive

import "github.com/bytemare/crawl"

func myCrawler() {
	
	domain := "https://bytema.re"
	timeout := 10 * time.Second
	
	resultChan, err := crawl.StreamLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}

	for res := range resultChan {
		fmt.Printf("%s -> %s\n", res.URL, *res.Links)
	}
}

FetchLinks blocks, collects, explores, then returns all encountered links

import "github.com/bytemare/crawl"

func myCrawler() {

	domain := "https://bytema.re"
	timeout := 10 * time.Second

	links, err := crawl.FetchLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}
	
	fmt.Printf("Starting from %s, encountered following links :\n%s\n", domain, links)
}

Scraping a single page for links

If you simply want to scrap all links for a single web page, use the ScrapLinks function :

import "github.com/bytemare/crawl"

func myScraper() {

	domain := "https://bytema.re"
	timeout := 10 * time.Second

	links, err := crawl.ScrapLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}
	
	fmt.Printf("Found following links on %s :\n%s\n", domain, links)
}

Supported go versions

We support the last two major Go versions, which are 1.12 and 1.13 at the moment.

Contributing

Please feel free to submit issues, fork the repository and send pull requests! Take a look at the contributing guidelines !

License

This project is licensed under the terms of the MIT license.

Documentation ¶

Overview ¶

Package crawl is a simple link scraper and web crawler with single domain scope. It can be limited with a timeout and interrupted with signals.

Three public functions give access to single page link scraping (ScrapLinks) and single host web crawling (FetchLinks and StreamLinks). FetchLinks and StreamLinks have the same behaviour and result, as FetchLinks is a wrapper for StreamLinks. The only difference is that FetchLinks is blocking, and returns once a stopping condition is reached (link tree exhaustion, timeout, signals), where StreamLinks immediately returns a channel on which the calling function can listen on to get results as they come.

The return values can be used for a site map.

Some precautions have been taken to prevent infinite loops, like stripping queries and fragments off urls.

A sample program calling the package is given in the project repository.

Index ¶

func ScrapLinks(url string, timeout time.Duration) ([]string, error)
type CrawlerResults
- func FetchLinks(domain string, timeout time.Duration) (*CrawlerResults, error)
- func StreamLinks(domain string, timeout time.Duration) (*CrawlerResults, error)
type LinkMap

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ScrapLinks ¶

func ScrapLinks(url string, timeout time.Duration) ([]string, error)

ScrapLinks returns the links found in the web page pointed to by url

Types ¶

type CrawlerResults ¶

type CrawlerResults struct {
	// contains filtered or unexported fields
}

CrawlerResults is send back to the caller, containing results and information about the crawling

func FetchLinks ¶

func FetchLinks(domain string, timeout time.Duration) (*CrawlerResults, error)

FetchLinks is a wrapper around StreamLinks and does the same, except it blocks and accumulates all links before returning them to the caller.

func StreamLinks ¶

func StreamLinks(domain string, timeout time.Duration) (*CrawlerResults, error)

StreamLinks returns a channel on which it will report links as they come during the crawling. The caller should range over than channel to continuously retrieve messages. StreamLinks will close that channel when all encountered links have been visited and none is left, when the deadline on the timeout parameter is reached, or if a SIGINT or SIGTERM signals is received.

func (*CrawlerResults) ExitContext ¶

func (cr *CrawlerResults) ExitContext() string

func (*CrawlerResults) Links ¶

func (cr *CrawlerResults) Links() []string

func (*CrawlerResults) Stream ¶

func (cr *CrawlerResults) Stream() <-chan *LinkMap

type LinkMap ¶

type LinkMap struct {
	URL   string
	Links *[]string
	Error error
}

LinkMap holds the links of the web page pointed to by url, of the same host as the url

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL