scraper

command module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 21, 2023 License: MIT Imports: 8 Imported by: 0

README

scraper

GoDoc CI

A dual interface Go module for building simple web scrapers

Features
  • Go struct-tag interface
  • Command-line interface
    • HTML⇒JSON API server
    • Single binary
    • Simple configuration
    • Zero-downtime config reload with kill -s SIGHUP <scraper-pid>
Install

Binaries

See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash

Source

$ go get -v github.com/jpillora/scraper
Go Example
package main

import (
	"log"

	"github.com/jpillora/scraper/scraper"
)

func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`
	}

	type google struct {
		URL    string   `scraper:"https://www.google.com/search?q={{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`
	}

	g := google{Query: "hello world"}

	if err := scraper.Execute(&g); err != nil {
		log.Fatal(err)
	}

	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
	}
}
#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/
CLI Example

Given google.json

{
  "/search": {
    "url": "https://www.google.com/search?q={{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
    }
  }
}
$ scraper google.json
2015/05/16 20:10:46 listening on 3000...
$ curl "localhost:3000/search?query=hellokitty"
[
  {
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": "http://www.sanrio.com/"
  },
  {
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": "http://en.wikipedia.org/wiki/Hello_Kitty"
  },
  ...
JSON API
{
  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
      ...
    }
  }
}
  • <path> - Required The path of the scraper
    • Accessible at http://<host>:port/<path>
    • You may define path variables like: my/path/:var when set to /my/path/foo then :var = "foo"
  • <url> - Required The URL of the remote server to scrape
    • It may contain template variables in the form {{ var }}, scraper will look for a var path variable, if not found, it will then look for a query parameter var
  • result - Required represents the resulting JSON object, after executing the <extractor> on the current DOM context. A field may use sequence of <extractor>s to perform more complex queries.
  • <method> - The HTTP request method (defaults to GET)
  • <extractor> - A string in which must be one of:
    • a regex in form /abc/ - searches the text of the current DOM context (extracts the first group when provided).
    • a regex in form s/abc/xyz/ - searches the text of the current DOM context and replaces with the provided text (sed-like syntax).
    • an attribute in the form @abc - gets the attribute abc from the DOM context.
    • a function in the form html() - gets the DOM context as string
    • a function in the form trim() - trims space from the beginning and the end of the string
    • a query param in the form query-param(abc) - parses the current context as a URL and extracts the provided param
    • a css selector abc (if not in the forms above) alters the DOM context.
  • list - Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.
Go API

Replace <variable> with your configuration, documented above.

  1. Define your endpoint struct:
type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`
}

Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.

  1. Define your result struct:
type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`
}

The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.

  1. Execute it:
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
  ...
}
// e.Result is now set
Similar projects

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL