grawler

package module
v0.0.0-...-e441820 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 26, 2017 License: Apache-2.0 Imports: 9 Imported by: 0

README

Travis CI GoDoc Coverage Status Report Card

Web Crawler

A webcrawler library written in Go.

Installation

$ go get -u golang.org/x/net/html
$ go get -u github.com/frobware/grawler/...

The binary sitemap is an example of using the library.

Given a URL it will print a basic sitemap for the given domain, listing the links each page has, together with a list of assets found on each page. At the moment only img, script and link elements are considered an asset. sitemap will also only download links from the same domain. And downloads are, by default, concurrent, governed by the -j <N> argument.

$ sitemap -j 42 http://gopl.io
{
  "URL": "http://gopl.io",
  "Links": [
	"http://www.informit.com/store/go-programming-language-9780134190440",
	"http://www.amazon.com/dp/0134190440",
	"http://www.barnesandnoble.com/w/1121601944",
	"http://gopl.io/ch1.pdf",
	"https://github.com/adonovan/gopl.io/",
	"http://gopl.io/reviews.html",
	"http://gopl.io/translations.html",
	"http://gopl.io/errata.html",
	"http://golang.org/s/oracle-user-manual",
	"http://golang.org/lib/godoc/analysis/help.html",
	"https://github.com/golang/tools/blob/master/refactor/eg/eg.go",
	"https://github.com/golang/tools/blob/master/refactor/rename/rename.go",
	"http://www.amazon.com/dp/0131103628?tracking_id=disfordig-20",
	"http://www.amazon.com/dp/020161586X?tracking_id=disfordig-20"
  ],
  "Assets": [
	"style.css",
	"cover.png",
	"buyfromamazon.png",
	"informit.png",
	"barnesnoble.png"
  ]
},
{
  "URL": "http://gopl.io/errata.html",
  "Links": [
	"https://github.com/golang/proposal/blob/master/design/12416-cgo-pointers.md"
  ],
  "Assets": [
	"style.css"
  ]
},
{
  "URL": "http://gopl.io/reviews.html",
  "Links": [
	"https://www.usenix.org/system/files/login/articles/login_dec15_17_books.pdf",
	"http://lpar.ath0.com/2015/12/03/review-go-programming-language-book",
	"http://www.computingreviews.com/index_dynamic.cfm?CFID=15675338\u0026CFTOKEN=37047869",
	"http://www.infoq.com/articles/the-go-programming-language-book-review",
	"http://www.onebigfluke.com/2016/03/book-review-go-programming-language.html",
	"http://eli.thegreenplace.net/2016/book-review-the-go-programming-language-by-alan-donovan-and-brian-kernighan",
	"http://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/product-reviews/0134190440/ref=cm_cr_dp_see_all_summary"
  ],
  "Assets": [
	"style.css",
	"5stars.png"
  ]
},
{
  "URL": "http://gopl.io/translations.html",
  "Links": [
	"http://www.acornpub.co.kr/book/go-programming",
	"http://www.williamspublishing.com/Books/978-5-8459-2051-5.html",
	"http://helion.pl/ksiazki/jezyk-go-poznaj-i-programuj-alan-a-a-donovan-brian-w-kernighan,jgopop.htm",
	"http://helion.pl/",
	"http://www.amazon.co.jp/exec/obidos/ASIN/4621300253",
	"http://www.maruzen.co.jp/corp/en/services/publishing.html",
	"http://novatec.com.br/",
	"http://www.gotop.com.tw/",
	"http://www.pearsonapac.com/"
  ],
  "Assets": [
	"style.css"
  ]
},
{
  "URL": "http://gopl.io/ch1.pdf"
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewHTTPFetcher

func NewHTTPFetcher() *fetcher

NewHTTPFetcher returns a new Fetcher.

Types

type Fetcher

type Fetcher interface {
	// Fetch returns a reader for the body of the downloaded URL,
	// or error if it could not be downloaded. The caller is
	// responsible for body.Close().
	Fetch(url string) (body io.ReadCloser, err error)
}

Fetcher fetches HTML documents.

type FollowLink func(URL string) bool

FollowLink should return true if URL should be downloaded.

type Page

type Page struct {
	// The page that was crawled.
	URL string `json:"URL"`

	// The Fetch.Fetcher() error, if any.
	FetcherError error `json:",FetchError,omitempty"`

	// The set of links.
	Links []string `json:",omitempty"`

	// The set of assets.
	Assets []string `json:",omitempty"`
}

Page captures the links and assets in an HTML document.

func Crawl

func Crawl(URL string, maxWorkers int, fetcher Fetcher, followLink FollowLink) []*Page

Crawl from URL, using concurrent download workers. Each discovered link is added to the queue based on the result of applying it to the followLink filter.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL