creeper

package module
Version: v0.0.0-...-eb1753d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 16, 2017 License: Apache-2.0 Imports: 11 Imported by: 1

README

License Go Report Card Gitter Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation
$ go get github.com/wspl/creeper
Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MD5

func MD5(s string) string

func PowerfulFind

func PowerfulFind(s *goquery.Selection, q string) *goquery.Selection

Types

type Creeper

type Creeper struct {
	Nodes []*Node
	Towns []*Town

	CacheGet func(string) (string, bool)
	CacheSet func(string, string)

	Node *Node
}

func New

func New(raw string) *Creeper

func NewByFormatted

func NewByFormatted(f *Formatted) *Creeper

func Open

func Open(path string) *Creeper

func (*Creeper) Array

func (c *Creeper) Array(key string) *Creeper

func (*Creeper) Each

func (c *Creeper) Each(cle func(*Creeper))

func (*Creeper) Next

func (c *Creeper) Next() *Creeper

func (*Creeper) String

func (c *Creeper) String(key string) string

func (*Creeper) StringE

func (c *Creeper) StringE(key string) (string, error)

type Formatted

type Formatted struct {
	Raw string

	Towns []*Town
	Nodes []*Node
}

func Formatting

func Formatting(s string) *Formatted

type Fun

type Fun struct {
	Raw  string
	Node *Node

	Name   string
	Params []string

	Document  *goquery.Document
	Selection *goquery.Selection
	Result    string

	TempSkip bool

	BundleSize int

	PrevFun *Fun
	NextFun *Fun
}

func ParseFun

func ParseFun(n *Node, s string) *Fun

func (*Fun) Append

func (f *Fun) Append(s string) (*Fun, *Fun)

func (*Fun) InitSelector

func (f *Fun) InitSelector(root bool) error

func (*Fun) Invoke

func (f *Fun) Invoke() (string, error)

func (*Fun) PageBody

func (f *Fun) PageBody() (*goquery.Document, error)

type MonoStack

type MonoStack struct {
	// contains filtered or unexported fields
}

func (*MonoStack) Has

func (o *MonoStack) Has() bool

func (*MonoStack) Set

func (o *MonoStack) Set(s string)

func (*MonoStack) Value

func (o *MonoStack) Value() string

type Node

type Node struct {
	Raw     string
	Creeper *Creeper

	Name      string
	IsArray   bool
	IsPrimary bool
	IndentLen int

	Page *Page
	Fun  *Fun

	Index int

	Sn map[int]string

	PrevNode       *Node
	NextNode       *Node
	ParentNode     *Node
	FirstChildNode *Node
	LastChildNode  *Node
}

func ParseNode

func ParseNode(ln []string) []*Node

func (*Node) ChildFilter

func (n *Node) ChildFilter(cb func(*Node) bool) *Node

func (*Node) Filter

func (n *Node) Filter(cb func(*Node) bool) []*Node

func (*Node) Next

func (n *Node) Next()

func (*Node) NextDirectorNode

func (n *Node) NextDirectorNode() *Node

func (*Node) PrimaryNode

func (n *Node) PrimaryNode() *Node

func (*Node) Reset

func (n *Node) Reset()

func (*Node) Search

func (n *Node) Search(name string) *Node

func (*Node) SearchFlatScope

func (n *Node) SearchFlatScope(name string) *Node

func (*Node) SearchRef

func (n *Node) SearchRef(name string) *Node

func (*Node) Value

func (n *Node) Value() (string, error)

type Page

type Page struct {
	Raw  string
	Node *Node

	Town *Town
	Ref  string

	NextMode       bool
	NextUrl        string
	NextPendingUrl string
	NextReady      bool
	NextNoMore     bool

	Index int
}

func ParsePage

func ParsePage(n *Node, s string) *Page

func (*Page) Body

func (p *Page) Body() (string, error)

func (*Page) Inc

func (p *Page) Inc()

func (*Page) IsDynamic

func (p *Page) IsDynamic() bool

func (*Page) Url

func (p *Page) Url() (string, error)

type Town

type Town struct {
	Raw     string
	Creeper *Creeper
	Node    *Node

	Name     string
	Params   map[string]string
	Template string
}

func ParseTown

func ParseTown(ln []string) []*Town

func ParseTownLine

func ParseTownLine(l string) *Town

func Town_New

func Town_New() *Town

func (*Town) Attach

func (t *Town) Attach() bool

func (*Town) Get

func (t *Town) Get(k string) (string, bool)

func (*Town) HasParam

func (t *Town) HasParam(k string) bool

func (*Town) PreSet

func (t *Town) PreSet(k string) bool

func (*Town) Set

func (t *Town) Set(k string, v string) bool

func (*Town) Value

func (t *Town) Value() string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL