Documentation
¶
Overview ¶
Package app provides the main entry and task scheduling for the crawler application.
Index ¶
- Variables
- type App
- type Logic
- func (l *Logic) CountNodes() int
- func (l *Logic) GetAppConf(k ...string) interface{}
- func (l *Logic) GetMode() int
- func (l *Logic) GetOutputLib() []string
- func (l *Logic) GetSpiderByName(name string) option.Option[*spider.Spider]
- func (l *Logic) GetSpiderLib() []*spider.Spider
- func (l *Logic) GetSpiderQueue() crawler.SpiderQueue
- func (l *Logic) GetTaskJar() *distribute.TaskJar
- func (l *Logic) Init(mode int, port int, master string, w ...io.Writer) App
- func (l *Logic) IsPaused() bool
- func (l *Logic) IsRunning() bool
- func (l *Logic) IsStopped() bool
- func (l *Logic) LogGoOn() App
- func (l *Logic) LogRest() App
- func (l *Logic) PauseRecover()
- func (l *Logic) ReInit(mode int, port int, master string, w ...io.Writer) App
- func (l *Logic) Run()
- func (l *Logic) SetAppConf(k string, v interface{}) App
- func (l *Logic) SetLog(w io.Writer) App
- func (l *Logic) SpiderPrepare(original []*spider.Spider) App
- func (l *Logic) Status() int
- func (l *Logic) Stop()
Constants ¶
This section is empty.
Variables ¶
var LogicApp = New()
LogicApp is the global singleton core interface instance.
Functions ¶
This section is empty.
Types ¶
type App ¶
type App interface {
SetLog(io.Writer) App // Set global log output to terminal
LogGoOn() App // Resume log output
LogRest() App // Pause log output
Init(mode int, port int, master string, w ...io.Writer) App // Must call Init before using App (except SetLog)
ReInit(mode int, port int, master string, w ...io.Writer) App // Switch run mode and reset log output target
GetAppConf(k ...string) interface{} // Get global config
SetAppConf(k string, v interface{}) App // Set global config (not called in client mode)
SpiderPrepare(original []*spider.Spider) App // Must call after setting global params and before Run() (not called in client mode)
Run() // Block until task completes (call after all config is done)
Stop() // Terminate task mid-run in Offline mode (blocks until current task stops)
IsRunning() bool // Check if task is running
IsPaused() bool // Check if task is paused
IsStopped() bool // Check if task has stopped
PauseRecover() // Pause or resume task in Offline mode
Status() int // Return current status
GetSpiderLib() []*spider.Spider // Get all spider species
GetSpiderByName(string) option.Option[*spider.Spider] // Get spider by name
GetSpiderQueue() crawler.SpiderQueue // Get spider queue interface
GetOutputLib() []string // Get all output methods
GetTaskJar() *distribute.TaskJar // Return task jar
distribute.Distributor // Implements distributed interface
}
type Logic ¶
type Logic struct {
*cache.AppConf // Global config
*spider.SpiderSpecies // All spider species
crawler.SpiderQueue // Spider queue for current task
*distribute.TaskJar // Task storage passed between server and client
crawler.CrawlerPool // Crawler pool
teleport.Teleport // Socket duplex communication, JSON transport
sync.RWMutex
// contains filtered or unexported fields
}
func (*Logic) CountNodes ¶
CountNodes returns connected node count in server/client mode.
func (*Logic) GetAppConf ¶
GetAppConf returns global config value(s).
func (*Logic) GetOutputLib ¶
GetOutputLib returns all output methods.
func (*Logic) GetSpiderByName ¶
GetSpiderByName returns a spider by name.
func (*Logic) GetSpiderLib ¶
GetSpiderLib returns all spider species.
func (*Logic) GetSpiderQueue ¶
func (l *Logic) GetSpiderQueue() crawler.SpiderQueue
GetSpiderQueue returns the spider queue interface.
func (*Logic) GetTaskJar ¶
func (l *Logic) GetTaskJar() *distribute.TaskJar
GetTaskJar returns the task jar.
func (*Logic) PauseRecover ¶
func (l *Logic) PauseRecover()
PauseRecover pauses or resumes the task in Offline mode.
func (*Logic) SetAppConf ¶
SetAppConf sets a global config value.
func (*Logic) SpiderPrepare ¶
SpiderPrepare must be called after setting global params and immediately before Run(). original is the raw spider species from spider package without prior assignment. Spiders with explicit Keyin are not reassigned. Not called in client mode.
Directories
¶
| Path | Synopsis |
|---|---|
|
aid
|
|
|
history
Package history provides persistence and inheritance of success and failure request records.
|
Package history provides persistence and inheritance of success and failure request records. |
|
proxy
Package proxy provides proxy IP pool management and online filtering.
|
Package proxy provides proxy IP pool management and online filtering. |
|
Package crawler provides the core crawler engine for request scheduling and page downloading.
|
Package crawler provides the core crawler engine for request scheduling and page downloading. |
|
Package distribute provides distributed task scheduling and master-slave node communication.
|
Package distribute provides distributed task scheduling and master-slave node communication. |
|
teleport
Package teleport provides a high-concurrency API framework for distributed systems.
|
Package teleport provides a high-concurrency API framework for distributed systems. |
|
Package downloader defines the page downloader interface.
|
Package downloader defines the page downloader interface. |
|
request
Package request provides encapsulation and deduplication of crawl requests.
|
Package request provides encapsulation and deduplication of crawl requests. |
|
surfer
Package surfer provides a high-concurrency web downloader written in Go.
|
Package surfer provides a high-concurrency web downloader written in Go. |
|
surfer/agent
Package agent generates user agents strings for well known browsers and for custom browsers.
|
Package agent generates user agents strings for well known browsers and for custom browsers. |
|
surfer/example
command
|
|
|
Package pipeline provides the data collection and output pipeline.
|
Package pipeline provides the data collection and output pipeline. |
|
collector
Package collector implements result collection and output.
|
Package collector implements result collection and output. |
|
collector/data
Package data provides storage structure definitions for data and file cells.
|
Package data provides storage structure definitions for data and file cells. |
|
Package scheduler provides crawl task scheduling and resource allocation.
|
Package scheduler provides crawl task scheduling and resource allocation. |
|
Package spider provides spider rule definition, species registration, and parsing.
|
Package spider provides spider rule definition, species registration, and parsing. |
|
common
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.
|
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules. |