crawl

package module
v0.0.0-...-77b4e06 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 25, 2019 License: MIT Imports: 17 Imported by: 0

README

crawl

Build Status Coverage Go Report Card codebeat badge GolangCI CII Best Practices GoDoc

The crawler scraps a page for links, follows them and scrapes them in the same fashion.

You can launch the app with or without a timeout (in seconds), like this :

go run app/crawl.go (-timeout=10) https://bytema.re

However the program was launched, you can interrupt it with ctrl+c.

Features

  • single domain scope
  • parallel scrawling
  • optional timeout
  • scraps queries and fragments from url
  • avoid loops on already visited links
  • usable as a package by calling FetchLinks(), StreamLinks() and ScrapLinks() functions
  • logs to file in JSON for log aggregation

Get the Crawler : Installation and update

It's as easy as it gets with Go :

go get -u github.com/bytemare/crawl

Usage and examples

The scraper and crawler functions are rather easy to use. The timeout parameter is optional, if you don't need to timeout, just set it to 0.

Calling the crawler from your code

You can call the crawler from your own code with StreamLinks or FetchLinks.

StreamLinks returns a channel you can listen on for continuous results as they arrive

import "github.com/bytemare/crawl"

func myCrawler() {
	
	domain := "https://bytema.re"
	timeout := 10 * time.Second
	
	resultChan, err := crawl.StreamLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}

	for res := range resultChan {
		fmt.Printf("%s -> %s\n", res.URL, *res.Links)
	}
}

FetchLinks blocks, collects, explores, then returns all encountered links

import "github.com/bytemare/crawl"

func myCrawler() {

	domain := "https://bytema.re"
	timeout := 10 * time.Second

	links, err := crawl.FetchLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}
	
	fmt.Printf("Starting from %s, encountered following links :\n%s\n", domain, links)
}

If you simply want to scrap all links for a single web page, use the ScrapLinks function :

import "github.com/bytemare/crawl"

func myScraper() {

	domain := "https://bytema.re"
	timeout := 10 * time.Second

	links, err := crawl.ScrapLinks(domain, timeout)
	if err != nil {
		fmt.Printf("Error : %s\n", err)
		os.Exit(1)
	}
	
	fmt.Printf("Found following links on %s :\n%s\n", domain, links)
}

Supported go versions

We support the last two major Go versions, which are 1.12 and 1.13 at the moment.

Contributing

Please feel free to submit issues, fork the repository and send pull requests! Take a look at the contributing guidelines !

License

This project is licensed under the terms of the MIT license.

Documentation

Overview

Package crawl is a simple link scraper and web crawler with single domain scope. It can be limited with a timeout and interrupted with signals.

Three public functions give access to single page link scraping (ScrapLinks) and single host web crawling (FetchLinks and StreamLinks). FetchLinks and StreamLinks have the same behaviour and result, as FetchLinks is a wrapper for StreamLinks. The only difference is that FetchLinks is blocking, and returns once a stopping condition is reached (link tree exhaustion, timeout, signals), where StreamLinks immediately returns a channel on which the calling function can listen on to get results as they come.

The return values can be used for a site map.

Some precautions have been taken to prevent infinite loops, like stripping queries and fragments off urls.

A sample program calling the package is given in the project repository.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ScrapLinks(url string, timeout time.Duration) ([]string, error)

ScrapLinks returns the links found in the web page pointed to by url

Types

type CrawlerResults

type CrawlerResults struct {
	// contains filtered or unexported fields
}

CrawlerResults is send back to the caller, containing results and information about the crawling

func FetchLinks(domain string, timeout time.Duration) (*CrawlerResults, error)

FetchLinks is a wrapper around StreamLinks and does the same, except it blocks and accumulates all links before returning them to the caller.

func StreamLinks(domain string, timeout time.Duration) (*CrawlerResults, error)

StreamLinks returns a channel on which it will report links as they come during the crawling. The caller should range over than channel to continuously retrieve messages. StreamLinks will close that channel when all encountered links have been visited and none is left, when the deadline on the timeout parameter is reached, or if a SIGINT or SIGTERM signals is received.

func (*CrawlerResults) ExitContext

func (cr *CrawlerResults) ExitContext() string
func (cr *CrawlerResults) Links() []string

func (*CrawlerResults) Stream

func (cr *CrawlerResults) Stream() <-chan *LinkMap

type LinkMap

type LinkMap struct {
	URL   string
	Links *[]string
	Error error
}

LinkMap holds the links of the web page pointed to by url, of the same host as the url

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL