scraper

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 2, 2018 License: MIT Imports: 19 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrNotMatchOrigin = errors.New("redirection does not match origin host")

ErrNotMatchOrigin indicates that the end location is external to the host we were originally looking up.

View Source
var ErrTooManyRedirects = errors.New("too many redirects (10+)")

ErrTooManyRedirects indicates that the requested origin redirected more than a considerable amount of times, indicating there may be a redirect loop.

Functions

func VerifyHostname

func VerifyHostname(c *tls.ConnectionState, host string) error

VerifyHostname verifies if the tls.ConnectionState certificate matches the hostname

Types

type CertName

type CertName struct {
	Country       string
	Organization  string
	Locality      string
	Province      string
	StreetAddress string
	CommonName    string
}

type Crawler

type Crawler struct {
	Log *log.Logger // output log

	Results []*FetchResult // scan results, should only be access when scan is complete
	Pool    sempool.Pool   // thread pool for fetching main resources
	ResPool sempool.Pool   // thread pool for fetching assets
	Cnf     CrawlerConfig
	// contains filtered or unexported fields
}

Crawler is the higher level struct which wraps the entire threaded crawl process

func (*Crawler) Crawl

func (c *Crawler) Crawl()

Crawl represents the higher level functionality of scraper. Crawl should concurrently request the needed resources for a list of domains, allowing the bypass of DNS lookups where necessary.

func (*Crawler) Fetch

func (c *Crawler) Fetch(res *FetchResult)

Fetch manages the fetching of the main resource, as well as all child resources, providing a FetchResult struct containing the entire crawl data needed

func (*Crawler) Get

func (c *Crawler) Get(url string) (*CustomResponse, error)

Get wraps GetHandler -- easy interface for making get requests

func (*Crawler) IsRemote

func (c *Crawler) IsRemote(host string) bool

IsRemote checks to see if host is remote, and if it should be scanned

type CrawlerConfig

type CrawlerConfig struct {
	Domains       []*Domain     // list of domains to scan
	Assets        bool          // if we want to pull the assets for the page too
	NoRemote      bool          // ignore all resources that match a remote IP
	AllowInsecure bool          // if SSL errors should be ignored
	Delay         time.Duration // delay before each resource is crawled
	HTTPTimeout   time.Duration // http timeout before a request has become stale
	Threads       int           // total number of threads to run crawls in
}

CrawlerConfig is the configuration which changes Crawler

type CustomClient

type CustomClient struct {
	URL       string
	Host      string
	ResultURL url.URL  // represents the url for the resulting request, without modifications
	OriginURL *url.URL // represents the url from the original request, without modifications
	// contains filtered or unexported fields
}

CustomClient is the state for our custom http wrapper, which houses the needed data to be able to rewrite the outgoing request during redirects.

type CustomResponse

type CustomResponse struct {
	*http.Response
	Time *utils.TimerResult
	URL  *url.URL
}

CustomResponse is the wrapped response from http.Client.Do() which also includes a timer of how long the request took, and a few other minor extras.

type Domain

type Domain struct {
	URL *url.URL `json:"-"`
	IP  string
}

Domain represents a url we need to fetch, including the items needed to fetch said url. E.g: host, port, ip, scheme, path, etc.

func (*Domain) String

func (d *Domain) String() string

type FetchResult

type FetchResult struct {
	Resource                        // Inherit the Resource struct
	Assets       []*Resource        `json:"-"` // Assets containing the needed resources for the given URL
	ResourceTime *utils.TimerResult // ResourceTime is the time it took to fetch all resources
	TotalTime    *utils.TimerResult // TotalTime is the time it took to crawl the site
}

FetchResult -- struct returned by Crawl() to represent the entire crawl process

func (*FetchResult) String

func (r *FetchResult) String() string

type HostnameError

type HostnameError struct {
	Certificate *x509.Certificate
	Host        string
}

HostnameError appears when an invalid SSL certificate is supplied

func (HostnameError) Error

func (h HostnameError) Error() string

type Resource

type Resource struct {
	URL      string             // the url -- this should exist regardless of failure
	Request  *Domain            // request represents what we were provided before the request
	Response Response           // Response represents the end result/data/status/etc.
	Error    error              // Error represents an error of a completely failed request
	Time     *utils.TimerResult // Time is the time it took to complete the request
}

Resource represents a single entity of many within a given crawl. These should only be of type css, js, jpg, png, etc (static resources).

func (*Resource) String

func (r *Resource) String() string

type Response

type Response struct {
	Remote        bool         // Remote is true if the origin is remote (unknown ip)
	Code          int          // Code is the numeric HTTP based status code
	URL           *url.URL     `json:"-"` // URL is the resulting static URL derived by the original result page
	Body          string       // Body is the response body. Used for primary requests, ignored for Resource structs.
	Headers       http.Header  // Headers is a map[string][]string of headers
	ContentLength int64        // ContentLength is the number of bytes in the body of the response
	TLS           *TLSResponse // TLS is the SSL/TLS session if the resource was loaded over SSL/TLS
}

Response represents the data for the HTTP-based request, closely matching http.Response

type ResponseCert

type ResponseCert struct {
	Version        int
	NotBefore      time.Time
	NotAfter       time.Time
	Issuer         *CertName
	Subject        *CertName
	DNSNames       []string
	EmailAddresses []string
	IPAddresses    []net.IP
}

type TLSResponse

type TLSResponse struct {
	HandshakeComplete bool
	PeerCertificates  []*ResponseCert
	VerifiedChains    [][]*ResponseCert
}

TLSResponse is the TLS/SSL handshake response and certificate information.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL