hermes

package module

v0.0.0-...-0516b29 Latest Latest Go to latest Published: Jul 3, 2017 License: BSD-3-Clause Imports: 20 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jtaylor32/hermes

Links

Open Source Insights

README ¶

Whats is Hermes? 🏃💨

This is a combination of a couple awesome packages goquery + fetchbot that will crawl a list of links and scrape the pages.

This package is completely a proof-of-concept idea to use. The storage layer only interacts with Elasticsearch at the moment.

[As of 4-28-2017]: Will be working on refactoring this full package. Will be a more idiomatic version. This was something initially to learn more about Go and web crawling/scraping.

I will add more examples of how to use the newer refactor as well.

Hermes

Install

go get github.com/jtaylor32/hermes

API Usage

Runner

Basically a Runner is just an easier way to configure a web crawler combined with a scraper. Depending on your TopLevelDomain + Subdomain flags it will run through all of the nested links starting at the URL. The other struct fields will make your Runner more granular as well. The Tags are specific HTML tags you would like to pull from pages you are scraping.

A call to Runner.Crawl() will start you Runner and return an array of Documents and error. It will handle all the dynamic scraping and running under the scenes based on your Runner fields/values.

Elasticsearch

Elasticsearch is a struct of an Elasticsearch host, index, and type. This is where you can specify where you are storing the Documents from the Crawl().

License

The BSD 3-Clause license, the same as the Go language.

Acknowledgments

Huge thanks to Martin Angers @mna and the work he has done on all his projects!

Documentation ¶

Index ¶

Constants
Variables
type CustomSettings
type Document
type Elasticsearch
- func (e *Elasticsearch) Store(n int, docs []Document) error
type Index
type IngestionDocument
type Runner
- func New() *Runner
- func (r *Runner) Crawl() ([]Document, error)
type Settings
- func ParseSettings() Settings
type Sources
- func ParseLinks() Sources

Constants ¶

View Source

const (
	DefaultUserAgent = "Hermes Bot (github.com/jtaylor32/hermes"
)

Variables ¶

View Source

var (
	// ErrNilHostParameter defines you cannot have a nil elasticsearch host address
	ErrNilHostParameter = errors.New("missing host parameter")
	// ErrNilIndexParameter defines you cannot have a nil elasticsearch index name
	ErrNilIndexParameter = errors.New("missing index parameter")
	// ErrNilTypeParameter defines you cannot have a nil elasticsearch type name
	ErrNilTypeParameter = errors.New("missing type parameters")
	// ErrNegativeNParameter defines you cannot have a negative value of documents
	ErrNegativeNParameter = errors.New("n parameter cannot be negative")
)

Functions ¶

This section is empty.

Types ¶

type CustomSettings ¶

type CustomSettings struct {
	RootLink       string   `json:"link"`
	Tags           []string `json:"tags"`
	Subdomain      bool     `json:"subdomain"`
	TopLevelDomain bool     `json:"top_level_domain"`
}

CustomSettings struct to model custom settings we want to scrape from a specific page

type Document ¶

type Document struct {
	ID          string    `json:"id"`
	Title       string    `json:"title"`
	Description string    `json:"description"`
	Content     string    `json:"content"`
	Link        string    `json:"link"`
	Tag         string    `json:"tag"`
	Time        time.Time `json:"time"`
}

Document stuct to model our single "Document" store we will ingestion into the elasticsearch index/type

type Elasticsearch ¶

type Elasticsearch struct {
	Host, Index, Type string
}

The Elasticsearch struct type is to model the storage into a single ELasticsearch node. It must have a host, index and type to ingest data to.

func (*Elasticsearch) Store ¶

func (e *Elasticsearch) Store(n int, docs []Document) error

Store function will take total documents, es host, es index, es type and the Documents to be ingested. It will return with an error if faulted or will print stats on ingestion process (Total, Requests/sec, Time to ingest)

type Index ¶

type Index struct {
	Host      string
	Index     string
	Documents []Document
}

Index struct to model each index ingestion set for our elasticsearch data

type IngestionDocument ¶

type IngestionDocument struct {
	Documents []Document
}

IngestionDocument struct to model our ingestion set for multiple types and Documents for our index

type Runner ¶

type Runner struct {
	// The CrawlDelay is the set time for the Runner to abide by.
	CrawlDelay time.Duration

	// The CancelDuration is the set time for the Runner to cancel immediately.
	CancelDuration time.Duration

	// The CancelAtURL is the specific URL that the Runner will cancel on.
	CancelAtURL string

	// The StopDuration is the set time for the Runner to stop at while still processing the remaining links in the queue.
	StopDuration time.Duration

	// The StopAtURL is the specific URL that the Runner will stop on. It will still process the remaining links in the queue.
	StopAtURL string

	// The MemStatsInterval is a set time for when the Runner will output memory statistics to standard output.
	MemStatsInterval time.Duration

	// The UserAgent is the Runner's user agent string name. Be polite and identify yourself for people to see.
	UserAgent string

	// The WorkerIdleTTL keeps a watch for an idle timeout. When the Runner is crawling if it has finished it's total crawl
	// it will exit after this timeout.
	WorkerIdleTTL time.Duration

	// AutoClose will make the Runner terminate and successfully exit after the WorkerIdleTTL if set to true.
	AutoClose bool

	// The URL a reference pointer to a URL type
	URL *url.URL

	// The Tags are the HTML tags you want to scrape with this Runner
	Tags []string

	// If you want to specify how many documents you want to crawl/scrape the Runner will hit you can specify the size here.
	// If you don't have a specific preference you can leave it alone or set it to 0.
	MaximumDocuments int

	// The TopLevelDomain is a toggle to determine if you want to limit the Runner to a specific TLD. (i.e. .com, .edu, .gov, etc.)
	// If it is set to true it will make sure it stays to the URL's specific TLD.
	TopLevelDomain bool

	// The Subdomain is a toggle to determine if you want to limit the Runner to a subdomain of the URL. If it is set to true
	// it will make sure it stays to the host's domain. Think of it like a wildcard -- *.github.com -- anything link that has
	// github.com will be fetched.
	Subdomain bool
	// contains filtered or unexported fields
}

A Runner defines the parameters for running a single instance of Hermes ETL

func New ¶

func New() *Runner

New returns a default Runner type. These values can be overwritten to whatever after initializing the new Runner reference.

func (*Runner) Crawl ¶

func (r *Runner) Crawl() ([]Document, error)

Crawl function that will take a url string and start firing out some crawling functions it will return true/false based on the url root it starts with.

type Settings ¶

type Settings struct {
	ElasticsearchHost  string        `json:"es_host"`            // host address for the elasticsearch instance
	ElasticsearchIndex string        `json:"es_index"`           // index name you are going to ingest data into
	ElasticsearchType  string        `json:"es_type"`            // type name you are going to ingest data into
	CrawlDelay         time.Duration `json:"crawl_delay"`        // delay time for the crawler to abide to
	CancelDuration     time.Duration `json:"cancel_duration"`    // time duration for canceling the crawler (immediate cancel)
	CancelAtURL        string        `json:"cancel_url"`         // specific URL to cancel the crawler at
	StopDuration       time.Duration `json:"stop_duration"`      // time duration for stopping the crawler (processes links on queue after duration time)
	StopAtURL          string        `json:"stop_url"`           // specific URL to stop the crawler at for a specific "root"
	MemStatsInterval   time.Duration `json:"mem_stats_interval"` // display memory statistics at a given interval
	UserAgent          string        `json:"user_agent"`         // set the user agent string for the crawler... to be polite and identify yourself
	WorkerIdleTTL      time.Duration `json:"worker_timeout"`     // time-to-live for a host URL's goroutine
	AutoClose          bool          `json:"autoclose"`          // sets the application to terminate if the WorkerIdleTTL time is passed (must be true)
	EnableLogging      bool          `json:"enable_logging"`     // sets whether or not to log to a file
}

Settings struct to model the settings we want to run our hermes application with.

func ParseSettings ¶

func ParseSettings() Settings

ParseSettings will parse a local settings.json file that is in the same directory as the executable. The json file will be all the configuration set by the user for the application.

type Sources ¶

type Sources struct {
	Links []CustomSettings `json:"links"` // an array of all the URL strings we want to start our crawler at
}

Sources struct to model a Type we want to ingest into the elasticsearch index and the links we want to crawl/scrape information to store in our index/type

func ParseLinks ¶

func ParseLinks() Sources

ParseLinks will parse the local data.json file that is in the same directory as the executable. The json file will be a "master" list of links we are going to crawl through.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL