wcrawler

package module
v0.0.0-...-f9fa47e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 23, 2023 License: MIT Imports: 14 Imported by: 0

README

WCrawler

Build Status codecov Go Report Card PkgGoDev

WCrawler is a simple web crawler CLI tool.

NOTE: This tool was created mainly for practice purposes and therefore doesn't rely on any library that facilitates crawling.

https://user-images.githubusercontent.com/17534422/109546768-85aec680-7ac2-11eb-8c72-2dbf7c7223a8.mp4

Usage

Exploring the Web:

❯ wcrawler explore --help
Explore the web by following links up to a pre-determined depth.
A depth of zero means no limit.

Usage:
  wcrawler explore URL [flags]


Flags:
  -d, --depth uint        depth of recursion (default 5)
  -h, --help              help for explore
  -s, --nostats           don't show live stats
  -o, --output string     file to save results (default "./web_graph.json")
  -r, --retry uint        retry requests when they timeout (default 2)
  -z, --stayinsubdomain   follow links only in the same subdomain
  -t, --timeout uint      HTTP requests timeout in seconds (default 10)
  -m, --treemode          doesn't add links which would point back to known nodes
  -w, --workers uint      number of workers making concurrent requests (default 100)

Visualizing the graph in the browser:

❯ wcrawler view --help
View web links relationships in the browser

Usage:
  wcrawler view [flags]

Flags:
  -h, --help            help for view
  -i, --input string    file containing the data (default "./web_graph.json")
  -n, --noautoopen      don't open browser automatically
  -o, --output string   HTML output file (default "./web_graph.html")

This will generate a webpage and load it on your default browser.

Spheres are coloured based on the URL subdomain, you can pan, tilt and rotate the scene, drag the spheres and move them around, hover to check the URL they represent and click on them to go straight to that URL.

NOTE: If you want to see a nice graph, make sure to run wcrawler explore with the -m flag. Tree mode doesn't create links back to the original URLs making for much nicer visualizations. Its utility? None, but the graphs are undeniably more beautiful.

Naturally, if you want a proper graph of the links visited and where they point to, just disregard the -m option. Don't try to visualize that, however, cos it's going to look ugly, if not freeze your browser entirely. Consider yourself warned :)

Example

The following command will crawl the web starting at the example.com website up to a max of 8 depth levels, using 5 workers with a 6 second timeout per request and saving the collected data to /tmp/result.json.

wcrawler explore https://example.com -d 8 -w 5 -t 6 -o /tmp/result.json

The following command will then generate an HTML file with a graph view of the data collected and load it onto the default web browser. Only try to visualize the graph if you have specified the -m option! It's going to be the wrong graph, but it's going to look nice!

wcrawler view -i /tmp/result.json

Considerations

Here I'm going to discuss the design decisions and a few caveats, but only when I'm actually done with the project.

Still have a few more things to do like:

  • Add logic to fetch website's robots.txt file and adhere to whatever it's in there. At the moment we are just crawling everything (feeling like an outlaw here at the minute)
  • Show last 10 errors in the CLI while crawling
  • Make output more colorful
  • Docs, docs and more docs
  • Write more unittests
  • Increase coverage and run some benchmarks (I'm pretty sure I can speed up some parts and reduce allocations, even though this program is I/O bound more than anything else so won't benefit much from these optimizations, but practice is practice)
  • Add golangci-lint to travis-ci (cos it's quite nice)
  • Organize code in a way that makes it for a useful library (mostly done)

Third party libraries being used (directly):

Could have written the whole thing without using any library, but reusability is not a bad idea at all!

The only rule I had was to not use any library that facilitates crawling.

- github.com/gosuri/uilive     [updating terminal output in realtime]
- github.com/spf13/cobra       [CLI args and flags parsing]
- github.com/stretchr/testify  [writing unit tests]
- golang.org/x/net             [HTML parsing]
- github.com/oleiade/lane      [Provides a Queue data structure implementation]

Staying up to date

To update wcrawler to the latest version, use go get -u github.com/gustavooferreira/wcrawler.


Build

To build this project run:

make build

The wcrawler binary will be placed inside the bin/ folder.


Tests

To run tests:

make test

To get coverage:

make coverage

Free tip

If you run make without any targets, it will display all options available on the makefile followed by a short description.


Contributing

I'd normally be more than happy to accept pull requests, but given that I've created this project with the sole intent of practicing, it doesn't make sense for me to accept other people's work.

However, feel free to fork the project and add whatever new features you feel like.

I'd still be glad if you notice a bug and report it by opening an issue.


License

This project is licensed under the terms of the MIT license.

Documentation

Index

Constants

View Source
const (
	// AppState_Unknown represents the 'unknown' state.
	AppState_Unknown = iota
	// AppState_IDLE represents the 'idle' state.
	AppState_IDLE
	// AppState_Running represents the 'run' state.
	AppState_Running
	// AppState_Finished represents the 'finish' state.
	AppState_Finished
)

Variables

This section is empty.

Functions

This section is empty.

Types

type AppState

type AppState int

AppState represents the current state of the App.

func (*AppState) Parse

func (as *AppState) Parse(state string) error

Parse parses a string into AppState returning an error if string passed cannot be parsed into a valid state.

func (AppState) String

func (as AppState) String() string

String returns the string representation of AppState.

type Connector

type Connector interface {
	GetLinks(rawURL string) (statusCode int, links []URLEntity, latency time.Duration, err error)
}

Connector describes the connector interface.

type Crawler

type Crawler struct {

	// Read-only vars
	InitialURL string

	Stats           bool
	ShowErrors      bool
	WorkersCount    int
	Depth           int
	StayInSubdomain bool
	TreeMode        bool
	SubDomain       string
	Retry           int
	// contains filtered or unexported fields
}

Crawler brings everything together and is responsible for starting goroutines and manage them.

func NewCrawler

func NewCrawler(connector Connector, initialURL string, retry int, linksWriter io.Writer, stats bool, showErrors bool, stayinsubdomain bool, treemode bool, workersCount int, depth int) (*Crawler, error)

NewCrawler returns a new Crawler.

func (*Crawler) Merger

func (c *Crawler) Merger(wg *sync.WaitGroup)

Merger gets the results from the workers (links) and keeps all the relevant information feeding the new links to workers via another channel.

func (*Crawler) Run

func (c *Crawler) Run()

Run starts crawling.

func (*Crawler) StatsWriter

func (c *Crawler) StatsWriter(wg *sync.WaitGroup)

StatsWriter writes stats to a io.Writer (e.g. os.Stdout)

func (*Crawler) WorkerRun

func (c *Crawler) WorkerRun(wg *sync.WaitGroup)

WorkerRun represents the workers crawling links in a goroutine. Receives tasks in a channel and returns results on another. When tasks channel is closed, the workers return.

type EdgesSet

type EdgesSet map[int]struct{}

func NewEdgesSet

func NewEdgesSet() EdgesSet

func (EdgesSet) Add

func (es EdgesSet) Add(elems ...int)

func (EdgesSet) Count

func (es EdgesSet) Count() int

func (EdgesSet) Dump

func (es EdgesSet) Dump() []int

func (EdgesSet) MarshalJSON

func (es EdgesSet) MarshalJSON() ([]byte, error)

func (EdgesSet) Remove

func (es EdgesSet) Remove(elem int)

func (*EdgesSet) UnmarshalJSON

func (es *EdgesSet) UnmarshalJSON(b []byte) error

type RMEntry

type RMEntry struct {
	ParentURL  string
	URL        URLEntity
	Depth      int
	StatusCode int
	ErrString  string
}

RMEntry represents an entry in the RecordManager (external interface).

type Record

type Record struct {
	// Index allows easy referencing of records (used in the edges)
	Index int `json:"index"`
	// This indicates whether this is the start of the graph
	// i.e., URL provided.
	InitPoint bool   `json:"initPoint"`
	URL       string `json:"url"`
	Host      string `json:"host"`
	Depth     int    `json:"depth"`
	// Edges      []uint `json:"edges"`
	// This is supposed to be mimicing a hashset
	// We use a struct as a value as it's a bit more space efficient
	Edges      EdgesSet `json:"edges"`
	StatusCode int      `json:"statusCode"`
	ErrString  string   `json:"errString,omitempty"`
}

Record represents an entry in the RecordManager (internal state).

type RecordManager

type RecordManager struct {
	// Keeps a table of Records. Key is the URL (scheme,authority,path,query)
	Records    map[string]Record
	IndexCount int
}

RecordManager keeps track of links visited and some metadata like depth level and its children.

func NewRecordManager

func NewRecordManager() *RecordManager

NewRecordManager returns a new Record Manager.

func (*RecordManager) AddEdge

func (rm *RecordManager) AddEdge(fromURL string, toURL string) error

AddEdge adds a new edge to a record if not already present.

func (*RecordManager) AddRecord

func (rm *RecordManager) AddRecord(entry RMEntry)

AddRecord adds a record to the RecordManager.

func (*RecordManager) Count

func (rm *RecordManager) Count() int

Count counts the number of records.

func (*RecordManager) Dump

func (rm *RecordManager) Dump() map[string]Record

Dump returns all records in the RecordManager.

func (*RecordManager) Exists

func (rm *RecordManager) Exists(rawURL string) bool

Exists checks whether this URL exists in the table.

func (*RecordManager) Get

func (rm *RecordManager) Get(rawURL string) (Record, bool)

Get returns a record from the Record Manager.

func (*RecordManager) LoadFromReader

func (rm *RecordManager) LoadFromReader(r io.Reader) error

LoadFromReader reads the records from a Reader in JSON format. Can pass a os.File, to read from a file.

func (*RecordManager) SaveToWriter

func (rm *RecordManager) SaveToWriter(w io.Writer, indent bool) error

SaveToWriter dumps the records map into a Writer in JSON format. Can pass a os.File, to write to a file.

func (*RecordManager) Update

func (rm *RecordManager) Update(rawURL string, statusCode int, err error) error

Update updates entry in the table.

type Result

type Result struct {
	ParentURL  string
	StatusCode int
	Links      []URLEntity
	// Depth of the ParentURL
	Depth int
	Err   error
}

Result is what workers return in a channel.

type StatsCLIOutWriter

type StatsCLIOutWriter struct {
	// contains filtered or unexported fields
}

StatsCLIOutWriter keeps track of stats and writes to a writer up to date stats.

func NewStatsCLIOutWriter

func NewStatsCLIOutWriter(writer io.Writer, showErrors bool, totalWorkersCount int, depth int) *StatsCLIOutWriter

NewStatsCLIOutWriter returns a new StatsCLIOutWriter.

func (*StatsCLIOutWriter) AddErrorEntry

func (sm *StatsCLIOutWriter) AddErrorEntry(value string)

func (*StatsCLIOutWriter) AddLatencySample

func (sm *StatsCLIOutWriter) AddLatencySample(value time.Duration)

func (*StatsCLIOutWriter) IncDecDepth

func (sm *StatsCLIOutWriter) IncDecDepth(value int)

func (*StatsCLIOutWriter) IncDecErrorsCount

func (sm *StatsCLIOutWriter) IncDecErrorsCount(value int)

func (*StatsCLIOutWriter) IncDecLinksCount

func (sm *StatsCLIOutWriter) IncDecLinksCount(value int)

func (*StatsCLIOutWriter) IncDecLinksInQueue

func (sm *StatsCLIOutWriter) IncDecLinksInQueue(value int)

func (*StatsCLIOutWriter) IncDecTotalRequestsCount

func (sm *StatsCLIOutWriter) IncDecTotalRequestsCount(value int)

func (*StatsCLIOutWriter) IncDecWorkersRunning

func (sm *StatsCLIOutWriter) IncDecWorkersRunning(value int)

func (*StatsCLIOutWriter) RunOutputFlusher

func (sm *StatsCLIOutWriter) RunOutputFlusher()

This functions writes the updated stats to an io.Writer Run this in a goroutine

func (*StatsCLIOutWriter) SetAppState

func (sm *StatsCLIOutWriter) SetAppState(state AppState)

func (*StatsCLIOutWriter) SetDepth

func (sm *StatsCLIOutWriter) SetDepth(value int)

func (*StatsCLIOutWriter) SetErrorsCount

func (sm *StatsCLIOutWriter) SetErrorsCount(value int)

func (*StatsCLIOutWriter) SetLinksCount

func (sm *StatsCLIOutWriter) SetLinksCount(value int)

func (*StatsCLIOutWriter) SetLinksInQueue

func (sm *StatsCLIOutWriter) SetLinksInQueue(value int)

func (*StatsCLIOutWriter) SetTotalRequestsCount

func (sm *StatsCLIOutWriter) SetTotalRequestsCount(value int)

func (*StatsCLIOutWriter) SetWorkersRunning

func (sm *StatsCLIOutWriter) SetWorkersRunning(value int)

type StatsManager

type StatsManager interface {
	SetAppState(state AppState)
	SetLinksInQueue(value int)
	IncDecLinksInQueue(value int)
	SetLinksCount(value int)
	IncDecLinksCount(value int)
	SetErrorsCount(value int)
	IncDecErrorsCount(value int)
	SetWorkersRunning(value int)
	IncDecWorkersRunning(value int)
	SetTotalRequestsCount(value int)
	IncDecTotalRequestsCount(value int)
	SetDepth(value int)
	IncDecDepth(value int)
	AddLatencySample(value time.Duration)
	RunOutputFlusher()
}

StatsManager represents a tracker of statistics related to the crawler. This interface is unfortunately quite big as it needs to support several operations on the statistics it keeps track of.

type Task

type Task struct {
	URL   string
	Depth int
}

Task is what gets sent to the channel for workers to pull data from the web.

type URLEntity

type URLEntity struct {
	// NetLoc represents the NetLoc portion of the URL
	NetLoc string
	// Raw represents the entire URL
	Raw string
}

URLEntity represents a URL.

func ExtractURL

func ExtractURL(rawURL string) (urlEntity URLEntity, err error)

ExtractURL takes any URL and returns a URL string with scheme,authority,path ready to be used as a parent URL.

func JoinURLs

func JoinURLs(baseURL string, rawURL string) (URLEntity, error)

JoinURLs behaves the same way as parent URL, except that it also includes query params. If URL provided is relative, it will join the URLs. It will return an error if URL is of an unwanted type, like 'mailto'.

type WebClient

type WebClient struct {
	// contains filtered or unexported fields
}

WebClient is responsible to connect to the links and manage connections to websites. Implements Connector interface.

func NewWebClient

func NewWebClient(client *http.Client) *WebClient

NewWebClient returns a new WebClient.

func (c *WebClient) GetLinks(rawURL string) (statusCode int, links []URLEntity, latency time.Duration, err error)

GetLinks returns all the links found in the webpage.

Directories

Path Synopsis
cmd
internal
ring
Package ring provides an implementation of a ring buffer containing strings.
Package ring provides an implementation of a ring buffer containing strings.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL