skrapa

command module
v1.0.1-0...-86d6993 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 3, 2018 License: MIT Imports: 7 Imported by: 0

README

Skrapa: Web Scraping Utility

Skrapa is a web scraping tool designed to be as easy to use for non-technical folk as possible. It combines the powerful Colly library with a simple configuration format. Simply write out a pipeline of commands to instruct Skrapa to follow links and collect data from pages.

To use Skrapa, download the latest release (MacOS) and create a configuration file for it to follow. Check out the examples folder for inspiration.

Run Skrapa from the command line:

$ skrapa --help

$ skrapa collect examples/github_stars.toml

$ skrapa export json github_stars.db

$ skrapa export csv github_stars.db

Skrapa Configuration Documentation

Skrapa configuration is in TOML format. It has two primary parts, the main configuration block and the pipeline. The main block tells Skrapa what URL to scrape and where to save data. The pipeline is a repeatable configuration block that consists of commands for Skrapa to follow.

# primary configuration block
[main]
url = "https://example.com" # the url to scrape
user_agent = "Skrapa" # the user agent sent to websites
allowed_domains = ["example.com"] # restrict any follow actions to these domains
delay = 1 # introduce a delay in seconds

# multiple pipeline blocks instruct Skrapa what to do
# currently there's two types of actions: Follow and Collect

[[pipeline]] # Follow example
selector = "a.link-class" # the 'selector' field allows Skrapa to use css selectors to find elements
action = "follow" # the 'action' field tells Skrapa what action to perform, in this case, follow a link
attr = "href" # the 'attr' field tells Skrapa which attribute of this element to use as a url to follow
visit_once = true # the "visit_once" field is used when the link you are following could appear again on subsequent pages, triggering a looping pipeline, this flag instructs Skrapa to only visit a given URL once

[[pipeline]] # Collect example
selector = "span.title"
action = "collect" # the collect action tells Skrapa this is data we want to save
column = "title" # the 'column' field tells Skrapa what column/field we should save this data under
attr = "text" # the 'attr' field tells Skrapa which attribute of this element we want to save

[[pipeline]] # add more pipeline blocks as needed...
selector = "span.name"
action = "collect"
column = "name"
attr = "text"

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL