grawler

package module

v0.0.0-...-e441820 Latest Latest Go to latest Published: May 26, 2017 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/frobware/grawler

Links

Open Source Insights

README ¶

Web Crawler

A webcrawler library written in Go.

Installation

$ go get -u golang.org/x/net/html
$ go get -u github.com/frobware/grawler/...

The binary sitemap is an example of using the library.

Given a URL it will print a basic sitemap for the given domain, listing the links each page has, together with a list of assets found on each page. At the moment only img, script and link elements are considered an asset. sitemap will also only download links from the same domain. And downloads are, by default, concurrent, governed by the -j <N> argument.

$ sitemap -j 42 http://gopl.io

{
  "URL": "http://gopl.io",
  "Links": [
	"http://www.informit.com/store/go-programming-language-9780134190440",
	"http://www.amazon.com/dp/0134190440",
	"http://www.barnesandnoble.com/w/1121601944",
	"http://gopl.io/ch1.pdf",
	"https://github.com/adonovan/gopl.io/",
	"http://gopl.io/reviews.html",
	"http://gopl.io/translations.html",
	"http://gopl.io/errata.html",
	"http://golang.org/s/oracle-user-manual",
	"http://golang.org/lib/godoc/analysis/help.html",
	"https://github.com/golang/tools/blob/master/refactor/eg/eg.go",
	"https://github.com/golang/tools/blob/master/refactor/rename/rename.go",
	"http://www.amazon.com/dp/0131103628?tracking_id=disfordig-20",
	"http://www.amazon.com/dp/020161586X?tracking_id=disfordig-20"
  ],
  "Assets": [
	"style.css",
	"cover.png",
	"buyfromamazon.png",
	"informit.png",
	"barnesnoble.png"
  ]
},
{
  "URL": "http://gopl.io/errata.html",
  "Links": [
	"https://github.com/golang/proposal/blob/master/design/12416-cgo-pointers.md"
  ],
  "Assets": [
	"style.css"
  ]
},
{
  "URL": "http://gopl.io/reviews.html",
  "Links": [
	"https://www.usenix.org/system/files/login/articles/login_dec15_17_books.pdf",
	"http://lpar.ath0.com/2015/12/03/review-go-programming-language-book",
	"http://www.computingreviews.com/index_dynamic.cfm?CFID=15675338\u0026CFTOKEN=37047869",
	"http://www.infoq.com/articles/the-go-programming-language-book-review",
	"http://www.onebigfluke.com/2016/03/book-review-go-programming-language.html",
	"http://eli.thegreenplace.net/2016/book-review-the-go-programming-language-by-alan-donovan-and-brian-kernighan",
	"http://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/product-reviews/0134190440/ref=cm_cr_dp_see_all_summary"
  ],
  "Assets": [
	"style.css",
	"5stars.png"
  ]
},
{
  "URL": "http://gopl.io/translations.html",
  "Links": [
	"http://www.acornpub.co.kr/book/go-programming",
	"http://www.williamspublishing.com/Books/978-5-8459-2051-5.html",
	"http://helion.pl/ksiazki/jezyk-go-poznaj-i-programuj-alan-a-a-donovan-brian-w-kernighan,jgopop.htm",
	"http://helion.pl/",
	"http://www.amazon.co.jp/exec/obidos/ASIN/4621300253",
	"http://www.maruzen.co.jp/corp/en/services/publishing.html",
	"http://novatec.com.br/",
	"http://www.gotop.com.tw/",
	"http://www.pearsonapac.com/"
  ],
  "Assets": [
	"style.css"
  ]
},
{
  "URL": "http://gopl.io/ch1.pdf"
}

Documentation ¶

Index ¶

func NewHTTPFetcher() *fetcher
type Fetcher
type FollowLink
type Page
- func Crawl(URL string, maxWorkers int, fetcher Fetcher, followLink FollowLink) []*Page

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewHTTPFetcher ¶

func NewHTTPFetcher() *fetcher

NewHTTPFetcher returns a new Fetcher.

Types ¶

type Fetcher ¶

type Fetcher interface {
	// Fetch returns a reader for the body of the downloaded URL,
	// or error if it could not be downloaded. The caller is
	// responsible for body.Close().
	Fetch(url string) (body io.ReadCloser, err error)
}

Fetcher fetches HTML documents.

type FollowLink ¶

type FollowLink func(URL string) bool

FollowLink should return true if URL should be downloaded.

type Page ¶

type Page struct {
	// The page that was crawled.
	URL string `json:"URL"`

	// The Fetch.Fetcher() error, if any.
	FetcherError error `json:",FetchError,omitempty"`

	// The set of links.
	Links []string `json:",omitempty"`

	// The set of assets.
	Assets []string `json:",omitempty"`
}

Page captures the links and assets in an HTML document.

func Crawl ¶

func Crawl(URL string, maxWorkers int, fetcher Fetcher, followLink FollowLink) []*Page

Crawl from URL, using concurrent download workers. Each discovered link is added to the queue based on the result of applying it to the followLink filter.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
sitemap

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL