crawl

package
v0.0.0-...-df4f15f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2019 License: Apache-2.0 Imports: 14 Imported by: 0

Documentation

Overview

Package app provides the clamber crawling package.

To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.

crawler := app.Crawler{
	DbWaitGroup: sync.WaitGroup{},
	AlreadyCrawled: make(map[string]struct{}),
	Logger: log.Logger,
	Store: app.DbStore,
}

Create a page object with the starting URL of your crawl.

page := &app.Page{Url: "https://golang.org"}

Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.

crawler.Crawl(result, 5)

Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.

crawler.DbWaitGroup.Wait()

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	AlreadyCrawled map[string]struct{}
	sync.Mutex
	DbWaitGroup          sync.WaitGroup
	BgWaitGroup          sync.WaitGroup
	BgNotified           bool
	BgWaitNotified       bool
	Store                *relationship.Store
	BackgroundCrawlDepth int
	CrawlUid             uuid.UUID
	Queue                *queue.Queue
}

Crawler holds objects related to the crawler

func New

func New() (c Crawler)

func (*Crawler) Crawl

func (crawler *Crawler) Crawl(currentPage *page.Page)

Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.

func (*Crawler) Create

func (crawler *Crawler) Create(currentPage *page.Page) (err error)

Create function checks for current page, creates if doesn't exist. Checks for parent page, creates if doesn't exist. Checks for edge between them, creates if doesn't exist.

func (crawler *Crawler) FindOrCreateLink(ctx *context.Context, parentUid string, currentUid string) (err error)

func (*Crawler) FindOrCreatePage

func (crawler *Crawler) FindOrCreatePage(ctx *context.Context, p *page.Page) (uid string, err error)

func (*Crawler) Get

func (crawler *Crawler) Get(currentPage *page.Page) (resp *http.Response, err error)

Get function manages HTTP request for page

func (*Crawler) Start

func (crawler *Crawler) Start() (err error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL