sqrape

package module
Version: v0.0.0-...-679141b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 21, 2016 License: AGPL-3.0 Imports: 8 Imported by: 1

README

Sqrape - Simple Query Scraping with CSS and Go Reflection

by Cathal Garvey, ©2016, Released under the GNU AGPLv3

Go Report Card

What

When scraping web content, one usually hopes that the content is laid out logically, and that proper or at least consistent web annotation exists. This means well-nested HTML, appropriate use of tags, descriptive CSS classes and unique CSS IDs. Ideally it also means that a given CSS selector will yield a consistent datatype, also.

In such cases, it's possible to define exactly what you want using only CSS and a type. For a scraping job, then, it would be ideal to just make a struct defining the content you want, and to scrape a page directly from that, right?

So, something like this:

type Tweet struct {
	Author  string `csss:"div.original-tweet;attr=data-screen-name"`
	TweetID int64  `csss:"div.original-tweet;attr=data-tweet-id"`
	Content string `csss:"p.js-tweet-text;text"`
}

type TwitterProfile struct {
	Tweets []Tweet `csss:"li.js-stream-item;obj"`
}

func main() {
	resp, _ := http.Get("https://twitter.com/onetruecathal")
	tp := new(TwitterProfile)
	csstostruct.ExtractHTMLReader(resp.Body, tp)
	for _, tweet := range tp.Tweets {
		fmt.Printf("@%s: %s\n", tweet.Author, tweet.Content)
	}
}

..well that's Sqrape. In fact, see examples/tweetgrab.go for the above as a CLI tool.

Note; that struct tag is csss, not css. It's "css selector", because I didn't want to clobber any preexisting css struct tag libs that may exist.

How?

Basics

Sqrape uses struct tags to figure out how to access and extract data. These tags consist of two portions; a CSS selector, and a data extractor, separated by a semicolon.. The former are an exercise for the reader and are well documented. CSS selectors are passed to goquery, under the hood, so consult docs there if in doubt.

One difference from goquery: Empty selectors are OK, and indicate "extract data from the entire selection"; these are more commonly useful for embedded structs or slices, where the passed data may be ready for extraction and require no further CSS searching.

The second portion simply indicates what part or form of the selected data is desired, and can take four forms, three of which are trivial:

  • text: The text contents of matched data are returned.
  • html: The HTML contents of matched data are returned
  • attr=<attribute name>: Extract the value of an attribute on the matched selection.
  • obj: This indicates a struct or array field that is parsed recursively.

Therefore, to extract the data-foo value from a div, use csss:"div[data-foo];attr=data-foo": this selects any div with a data-foo attribute, and extracts the value of that attribute.

To extract values other than strings, simply set the field type in your struct to the desired type; this magic is handled by mapstructure! So, if data-foo is a number, then the field the above tag annotates can be an int or int64.

If your field is a singleton, then the first value will be extracted in the case of attributes, and the concatenation of all values in the case of text or HTML. If your field is a slice, then the values will be added iteratively from the goquery selection.

If your field is a struct or slice of structs, then the extractor portion of the tag should be obj, to indicate that parsing data from extracted structs should be deferred to the embedded struct fields. See the Twitter example, above.

More Advanced: Optional Methods

Sometimes a datatype needs to be filled from multiple sources, or has fields that should only be filled under certain other conditions, or should have conditional or context-aware behaviour... for this, you can define optional methods that alter Sqrape's behaviour and allow you to selectively fill fields, or to perform post-processing or post-scrape data validation on your struct.

The methods supported so far include:

  • SqrapeFieldSelect(fieldName string, context...interface{}) (doField bool, cancelScrape error)
  • SqrapePostFlight(context... interface{}) error

The context argument in either case is a variadic list of arbitrary datatypes which are passed by you to the entrypoint functions when operating a scrape.

So, for example, you could implement multi-page scraping by passing the current URL to your scrape and defining a SqrapeFieldSelect method that fills fields only for relevant URLs.

Or, you could perform data validation on your expected output with a SqrapePostFlight method, either with hardcoded regex/validation or by passing per-job regex or callbacks. Any error you raise in PostFlight will be returned from the job to you.

What's Supported?

Nested structs and array fields containing either basic values or struct values. This means that, aside from map-fields, most stuff should just work. File an Issue for cases that fail and I'll try to get it working.

Take a look at the test cases for an example of what works well. Feel free to volunteer test cases.

What's Not Supported?

Pointer fields! If your field has a nested struct as a pointer, right now it will crash, and for reasons unknown to me you'll get no informative error while panic-catching is enabled in the entrypoint functions. I'm working on a fix that will initially just abort informatively on pointer fields, and later will work.

Credits Reel

Obviously, GoQuery deserves a huge slice of the credit for this.

A lot of the magic behind field-filling is thanks to mapstructure, which handles "weakly typed" field-filling for structs.

There's a lot of reflective magic in this code; right now that's predictably messy and due re-writing in pure reflect code. Meanwhile, thanks to structs and reflections for tiding me over this much of the project, by offering handy higher-level abstractions for reflect.

Reflection may give you the shivers; you're right, this code is potentially explosive right now! Caveat emptor. However, the entry point functions do have a blanket-recover deferred, so this code shouldn't panic, merely return an error on panicky behaviour. Please report any panic errors you encounter, to help me make this more stable.

Why?

I scrape content a lot. Weekly, sometimes daily, as part of my job or for personal research. Web scraping is just another way of consuming web content! I do most of my scraping in the IPython shell, but for something "important" I'll write something more permanent and use that whenever the need arises.

For this, one typically uses a scraping framework. But, permanence has disadvantages. If your scraping framework requires a lot of overhead for very basic tasks, then that means the maintenance burden when things change is also high.

I wanted something where creating and maintaining a scraper could be trivial, a matter of just defining the data I want and mapping it to the HTML. If or when the HTML changes, then I only need to change the datatypes or CSS rules and get back to using the data.

Documentation

Overview

Package sqrape provides a way to fill struct objects from raw HTML using CSS struct tags.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractHTMLReader

func ExtractHTMLReader(reader io.Reader, dest interface{}, context ...interface{}) (err error)

ExtractHTMLReader provides an entry point for parsing a HTML document in reader-form into a destination struct.

func ExtractHTMLString

func ExtractHTMLString(document string, dest interface{}, context ...interface{}) error

ExtractHTMLString provides an entry point for parsing a HTML document

Types

type FieldSelecter

type FieldSelecter interface {
	SqrapeFieldSelect(fieldName string, context ...interface{}) (bool, error)
}

FieldSelecter is an optional method set; if defined, then prior to collecting data for a field from the scraped content this method will be called; it should return (true, nil) to fill that field, (false, nil) to skip that field, and (_, error) to cancel scraping. The context argument is completely user-defined, and is not Sqrape's business. It is passed through from the variadic optional argument to Sqrape's entry functions, and is exactly as passed to those functions.

type PostFlighter

type PostFlighter interface {
	SqrapePostFlight(context ...interface{}) error
}

PostFlighter is an optional method for performing post-scrape operations on a struct. For example, scraped data might pull a set of data into one field, and a postflight operation might summarise those data and set another field. A more specific example might be a scraper that harvests view, favourite, and repost counts for social media posts, and then post-flight summarises the three counts as "interactions". The context arguments are exactly as passed to Sqrape's entry functions, and are there for user-defined behaviours. This method will only be called for finalised objects, so if some error or behaviour cancels a scrape this method will not be called.

Source Files

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL