hermes

package module
v0.0.0-...-0516b29 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 3, 2017 License: BSD-3-Clause Imports: 20 Imported by: 0

README

Whats is Hermes? 🏃💨

This is a combination of a couple awesome packages goquery + fetchbot that will crawl a list of links and scrape the pages.

This package is completely a proof-of-concept idea to use. The storage layer only interacts with Elasticsearch at the moment.

[As of 4-28-2017]: Will be working on refactoring this full package. Will be a more idiomatic version. This was something initially to learn more about Go and web crawling/scraping.

I will add more examples of how to use the newer refactor as well.


Hermes

Install

go get github.com/jtaylor32/hermes

API Usage

Runner

Basically a Runner is just an easier way to configure a web crawler combined with a scraper. Depending on your TopLevelDomain + Subdomain flags it will run through all of the nested links starting at the URL. The other struct fields will make your Runner more granular as well. The Tags are specific HTML tags you would like to pull from pages you are scraping.

A call to Runner.Crawl() will start you Runner and return an array of Documents and error. It will handle all the dynamic scraping and running under the scenes based on your Runner fields/values.

Elasticsearch

Elasticsearch is a struct of an Elasticsearch host, index, and type. This is where you can specify where you are storing the Documents from the Crawl().

License

The BSD 3-Clause license, the same as the Go language.

Acknowledgments

Huge thanks to Martin Angers @mna and the work he has done on all his projects!

Documentation

Index

Constants

View Source
const (
	DefaultUserAgent = "Hermes Bot (github.com/jtaylor32/hermes"
)

Variables

View Source
var (
	// ErrNilHostParameter defines you cannot have a nil elasticsearch host address
	ErrNilHostParameter = errors.New("missing host parameter")
	// ErrNilIndexParameter defines you cannot have a nil elasticsearch index name
	ErrNilIndexParameter = errors.New("missing index parameter")
	// ErrNilTypeParameter defines you cannot have a nil elasticsearch type name
	ErrNilTypeParameter = errors.New("missing type parameters")
	// ErrNegativeNParameter defines you cannot have a negative value of documents
	ErrNegativeNParameter = errors.New("n parameter cannot be negative")
)

Functions

This section is empty.

Types

type CustomSettings

type CustomSettings struct {
	RootLink       string   `json:"link"`
	Tags           []string `json:"tags"`
	Subdomain      bool     `json:"subdomain"`
	TopLevelDomain bool     `json:"top_level_domain"`
}

CustomSettings struct to model custom settings we want to scrape from a specific page

type Document

type Document struct {
	ID          string    `json:"id"`
	Title       string    `json:"title"`
	Description string    `json:"description"`
	Content     string    `json:"content"`
	Link        string    `json:"link"`
	Tag         string    `json:"tag"`
	Time        time.Time `json:"time"`
}

Document stuct to model our single "Document" store we will ingestion into the elasticsearch index/type

type Elasticsearch

type Elasticsearch struct {
	Host, Index, Type string
}

The Elasticsearch struct type is to model the storage into a single ELasticsearch node. It must have a host, index and type to ingest data to.

func (*Elasticsearch) Store

func (e *Elasticsearch) Store(n int, docs []Document) error

Store function will take total documents, es host, es index, es type and the Documents to be ingested. It will return with an error if faulted or will print stats on ingestion process (Total, Requests/sec, Time to ingest)

type Index

type Index struct {
	Host      string
	Index     string
	Documents []Document
}

Index struct to model each index ingestion set for our elasticsearch data

type IngestionDocument

type IngestionDocument struct {
	Documents []Document
}

IngestionDocument struct to model our ingestion set for multiple types and Documents for our index

type Runner

type Runner struct {
	// The CrawlDelay is the set time for the Runner to abide by.
	CrawlDelay time.Duration

	// The CancelDuration is the set time for the Runner to cancel immediately.
	CancelDuration time.Duration

	// The CancelAtURL is the specific URL that the Runner will cancel on.
	CancelAtURL string

	// The StopDuration is the set time for the Runner to stop at while still processing the remaining links in the queue.
	StopDuration time.Duration

	// The StopAtURL is the specific URL that the Runner will stop on. It will still process the remaining links in the queue.
	StopAtURL string

	// The MemStatsInterval is a set time for when the Runner will output memory statistics to standard output.
	MemStatsInterval time.Duration

	// The UserAgent is the Runner's user agent string name. Be polite and identify yourself for people to see.
	UserAgent string

	// The WorkerIdleTTL keeps a watch for an idle timeout. When the Runner is crawling if it has finished it's total crawl
	// it will exit after this timeout.
	WorkerIdleTTL time.Duration

	// AutoClose will make the Runner terminate and successfully exit after the WorkerIdleTTL if set to true.
	AutoClose bool

	// The URL a reference pointer to a URL type
	URL *url.URL

	// The Tags are the HTML tags you want to scrape with this Runner
	Tags []string

	// If you want to specify how many documents you want to crawl/scrape the Runner will hit you can specify the size here.
	// If you don't have a specific preference you can leave it alone or set it to 0.
	MaximumDocuments int

	// The TopLevelDomain is a toggle to determine if you want to limit the Runner to a specific TLD. (i.e. .com, .edu, .gov, etc.)
	// If it is set to true it will make sure it stays to the URL's specific TLD.
	TopLevelDomain bool

	// The Subdomain is a toggle to determine if you want to limit the Runner to a subdomain of the URL. If it is set to true
	// it will make sure it stays to the host's domain. Think of it like a wildcard -- *.github.com -- anything link that has
	// github.com will be fetched.
	Subdomain bool
	// contains filtered or unexported fields
}

A Runner defines the parameters for running a single instance of Hermes ETL

func New

func New() *Runner

New returns a default Runner type. These values can be overwritten to whatever after initializing the new Runner reference.

func (*Runner) Crawl

func (r *Runner) Crawl() ([]Document, error)

Crawl function that will take a url string and start firing out some crawling functions it will return true/false based on the url root it starts with.

type Settings

type Settings struct {
	ElasticsearchHost  string        `json:"es_host"`            // host address for the elasticsearch instance
	ElasticsearchIndex string        `json:"es_index"`           // index name you are going to ingest data into
	ElasticsearchType  string        `json:"es_type"`            // type name you are going to ingest data into
	CrawlDelay         time.Duration `json:"crawl_delay"`        // delay time for the crawler to abide to
	CancelDuration     time.Duration `json:"cancel_duration"`    // time duration for canceling the crawler (immediate cancel)
	CancelAtURL        string        `json:"cancel_url"`         // specific URL to cancel the crawler at
	StopDuration       time.Duration `json:"stop_duration"`      // time duration for stopping the crawler (processes links on queue after duration time)
	StopAtURL          string        `json:"stop_url"`           // specific URL to stop the crawler at for a specific "root"
	MemStatsInterval   time.Duration `json:"mem_stats_interval"` // display memory statistics at a given interval
	UserAgent          string        `json:"user_agent"`         // set the user agent string for the crawler... to be polite and identify yourself
	WorkerIdleTTL      time.Duration `json:"worker_timeout"`     // time-to-live for a host URL's goroutine
	AutoClose          bool          `json:"autoclose"`          // sets the application to terminate if the WorkerIdleTTL time is passed (must be true)
	EnableLogging      bool          `json:"enable_logging"`     // sets whether or not to log to a file
}

Settings struct to model the settings we want to run our hermes application with.

func ParseSettings

func ParseSettings() Settings

ParseSettings will parse a local settings.json file that is in the same directory as the executable. The json file will be all the configuration set by the user for the application.

type Sources

type Sources struct {
	Links []CustomSettings `json:"links"` // an array of all the URL strings we want to start our crawler at
}

Sources struct to model a Type we want to ingest into the elasticsearch index and the links we want to crawl/scrape information to store in our index/type

func ParseLinks() Sources

ParseLinks will parse the local data.json file that is in the same directory as the executable. The json file will be a "master" list of links we are going to crawl through.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL