crawl

package

v0.0.0-...-df4f15f Latest Latest Go to latest Published: Dec 20, 2019 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/stevenayers/clamber

Links

Open Source Insights

Documentation ¶

Overview ¶

Package app provides the clamber crawling package.

To initiate a crawl, create a Crawler with an empty sync.WaitGroup and struct map. DbWaitGroup is needed to ensure the clamber process does not exit before the crawler is done writing to the database. AlreadyCrawled keeps track of the URLs which have been crawled already in that crawl process. The rest are self explanatory.

crawler := app.Crawler{
	DbWaitGroup: sync.WaitGroup{},
	AlreadyCrawled: make(map[string]struct{}),
	Logger: log.Logger,
	Store: app.DbStore,
}

Create a page object with the starting URL of your crawl.

page := &app.Page{Url: "https://golang.org"}

Call Crawl on the Crawler object, passing in your page, and the depth of the crawl you want.

crawler.Crawl(result, 5)

Ensure your go process does not end before the crawled data has been saved to dgraph. If you need more logic to execute first, put the line below after this, as your application will hang on Wait() until we're done writing.

crawler.DbWaitGroup.Wait()

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	AlreadyCrawled map[string]struct{}
	sync.Mutex
	DbWaitGroup          sync.WaitGroup
	BgWaitGroup          sync.WaitGroup
	BgNotified           bool
	BgWaitNotified       bool
	Store                *relationship.Store
	BackgroundCrawlDepth int
	CrawlUid             uuid.UUID
	Queue                *queue.Queue
}

Crawler holds objects related to the crawler

func New ¶

func New() (c Crawler)

func (*Crawler) Crawl ¶

func (crawler *Crawler) Crawl(currentPage *page.Page)

Crawl function adds page to db (in a goroutine so it doesn't stop initiating other crawls), gets the child pages then initiates crawls for each one.

func (*Crawler) Create ¶

func (crawler *Crawler) Create(currentPage *page.Page) (err error)

Create function checks for current page, creates if doesn't exist. Checks for parent page, creates if doesn't exist. Checks for edge between them, creates if doesn't exist.

func (*Crawler) FindOrCreateLink ¶

func (crawler *Crawler) FindOrCreateLink(ctx *context.Context, parentUid string, currentUid string) (err error)

func (*Crawler) FindOrCreatePage ¶

func (crawler *Crawler) FindOrCreatePage(ctx *context.Context, p *page.Page) (uid string, err error)

func (*Crawler) Get ¶

func (crawler *Crawler) Get(currentPage *page.Page) (resp *http.Response, err error)

Get function manages HTTP request for page

func (*Crawler) Start ¶

func (crawler *Crawler) Start() (err error)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL