go_spider

module

v0.0.0-...-85ede20 Latest Latest Go to latest Published: Aug 9, 2015 License: MPL-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/hu17889/go_spider

README ¶

go_spider

A crawler of vertical communities achieved by GOLANG.

Latest stable Release: Version 1.2 (Sep 23, 2014).

QQ群号：337344607

Features

Concurrent
Fit for vertical communities
Flexible, Modular
Native Go implementation
Can be expanded to an individualized crawler easily

Requirements

Go 1.2 or higher

Documentation

中文文档 && 常见问题.

Installation

go get github.com/hu17889/go_spider
go get github.com/PuerkitoBio/goquery
go get github.com/bitly/go-simplejson
go get golang.org/x/net/html/charset

This project is based on simplejson, goquery.

You can download packages from http://gopm.io/ in China.

Use example

Here is an example for crawling github content. You can have a try of the crawl process.

go install github.com/hu17889/go_spider/example/github_repo_page_processor
./bin/github_repo_page_processor

More examples here: examples.

Make your spider

    // Spider input:
    //  PageProcesser ;
    //  Task name used in Pipeline for record;
    spider.NewSpider(NewMyPageProcesser(), "TaskName").
        AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen
        SetThreadnum(3).                                               // Crawl request by three Coroutines
        Run()

Use default modules
Downloader：HttpDownloader
Scheduler：QueueScheduler
Pipeline：PipelineConsole，PipelineFile
Use your modules

Just copy the default modules and modify it!

If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader).

If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline).

If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler).

Extensions

Extensions folder include modulers or other tools someone sharing. You can push your code without bugs.

Modulers

Spider

Summary: Crawler initialization, concurrent management, default moduler, moduler management, config setting.

Functions:

Clawler startup functions: Get, GetAll, Run
Add request: AddUrl, AddUrls, AddRequest, AddRequests
Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader
Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)
Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by mlog package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace

Downloader

Summary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser.

Functions:

Download: download content of the crawl objective. Result contains data body, header, cookies and request info.

PageProcesser

Summary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.

Functions:

Process: parse the objective crawled.

Page

Summary: save information of request.

Functions:

Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)
Get information of objective: GetRequest, GetCookies, GetHeader
Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)
Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)

Scheduler

Summary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.

Functions:

Push
Poll
Count

Pipeline

Summary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)

Functions:

Process

Request

Summary: The Request moduler has config for http request like url, header and cookies.

Functions:

Process

License

go_spider is licensed under the Mozilla Public License Version 2.0

Mozilla summarizes the license scope as follows:

MPL: The copyleft applies to any files containing MPLed code.

That means:

You can use the unchanged source code both in private as also commercial
You needn't publish the source code of your library as long the files licensed under the MPL 2.0 are unchanged
You must publish the source code of any changed files licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)

Please read the MPL 2.0 FAQ if you have further questions regarding the license.

You can read the full terms here: LICENSE.

Directories ¶

Path	Synopsis
core
common/com_interfaces Package com_interfaces contains some common interface of GO_SPIDER project.	Package com_interfaces contains some common interface of GO_SPIDER project.
common/config Package config provides for parse config file.	Package config provides for parse config file.
common/etc_config Package etc_config implements config initialization of one spider.	Package etc_config implements config initialization of one spider.
common/mlog Package mlog implements log operations.	Package mlog implements log operations.
common/page Package page contains result catched by Downloader.	Package page contains result catched by Downloader.
common/page_items Package page_items contains parsed result by PageProcesser.	Package page_items contains parsed result by PageProcesser.
common/request Package request implements request entity contains url and other relevant informaion.	Package request implements request entity contains url and other relevant informaion.
common/resource_manage Package resource_manage implements a resource management.	Package resource_manage implements a resource management.
common/util Package util contains some common functions of GO_SPIDER project.	Package util contains some common functions of GO_SPIDER project.
downloader Package downloader is the main module of GO_SPIDER for download page.	Package downloader is the main module of GO_SPIDER for download page.
page_processer
pipeline Package pipeline is the persistent and offline process part of crawler.	Package pipeline is the persistent and offline process part of crawler.
scheduler The package is useless	The package is useless
spider craw master module	craw master module
example
baidu_baike_page_processor
github_repo_page_processor
login_profile_page_processor
redis_scheduler
sina_stock_json_processor The example gets stock newses from site sina.com (http://live.sina.com.cn/zt/f/v/finance/globalnews1).	The example gets stock newses from site sina.com (http://live.sina.com.cn/zt/f/v/finance/globalnews1).
sohu_gossip_page_json_processor
weixin_sogou_cookie_processor
extension
scheduler

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL