crawler

package
v0.9.20 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2026 License: MIT Imports: 25 Imported by: 1

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ValidateJobs chan proxyinabox.Proxy

Functions

func BrowserEval

func BrowserEval(expression string) (string, error)

BrowserEval 在当前 session 的页面上执行 JS 表达式

func BrowserFetch

func BrowserFetch(targetURL string) (string, error)

BrowserFetch 启动临时 lightpanda 实例 → 导航到 URL → 等待 JS 渲染 → 返回 HTML 优先通过代理池中的随机 proxy 启动浏览器(lightpanda --http_proxy),若导航失败则销毁 session 并用直连重试。

func CleanupStaleProxies added in v0.8.0

func CleanupStaleProxies()

func FetchAllSources

func FetchAllSources(sources []Source)

FetchAllSources starts a goroutine per source to continuously fetch proxies

func GetDocFromURL

func GetDocFromURL(url string, customHeaders ...http.Header) (string, error)

GetDocFromURL fetches a URL body as string, optionally through a random proxy. 优先通过代理池中的随机 proxy 抓取,若代理抓取失败则 fallback 到直连重试,确保源站可达性最大化。

func GetURLThroughProxyWithRetry

func GetURLThroughProxyWithRetry(u string, timeout time.Duration, proxyAddr string, retry int, customHeaders ...http.Header) ([]byte, error)

GetURLThroughProxyWithRetry fetches a URL through the given proxy with retry logic

func Init

func Init()

func ReleaseBrowser added in v0.4.1

func ReleaseBrowser()

ReleaseBrowser 停止当前 lightpanda 实例,释放所有资源

func StartLightpanda added in v0.9.14

func StartLightpanda() error

StartLightpanda 兼容 test-source 子命令的预启动接口

func StopLightpanda added in v0.9.14

func StopLightpanda()

StopLightpanda 兼容 test-source 子命令和信号处理的停止接口

func TestFetchSource

func TestFetchSource(src Source) ([]proxyinabox.Proxy, error)

TestFetchSource performs a single fetch for testing purposes (does not send to ValidateJobs)

func UpdateSourceAvailableCounts added in v0.8.0

func UpdateSourceAvailableCounts(proxies []proxyinabox.Proxy)

UpdateSourceAvailableCounts 根据代理池快照更新各源的可用代理计数

func ValidateProxy added in v0.8.0

func ValidateProxy(p proxyinabox.Proxy) (country string, delay int64, err error)

ValidateProxy 通过代理访问 Cloudflare trace 端点验证代理可用性,返回验证结果 不依赖 DB/Cache,仅做网络验证,供 test-source 命令使用

func Verify

func Verify()

Types

type BrowserSession added in v0.4.1

type BrowserSession struct {
	// contains filtered or unexported fields
}

BrowserSession 管理单次浏览器抓取的完整生命周期(lightpanda 进程 + CDP 连接) 每个 runScript 调用创建独立 session,用完即销毁,避免资源泄漏

type Source

type Source struct {
	Name          string            `yaml:"name"`
	Type          string            `yaml:"type"` // text, json, script
	URL           string            `yaml:"url"`
	Protocol      string            `yaml:"protocol"`
	Headers       map[string]string `yaml:"headers"`
	Interval      string            `yaml:"interval"`
	IPField       string            `yaml:"ip_field"`
	PortField     string            `yaml:"port_field"`
	ProtocolField string            `yaml:"protocol_field"`
	Script        string            `yaml:"script"`
}

Source represents a YAML-driven proxy source configuration

func LoadSources

func LoadSources(dir string) ([]Source, error)

LoadSources reads all .yaml files from the given directory and returns parsed sources

type SourceStatus added in v0.2.0

type SourceStatus struct {
	Name       string    `json:"name"`
	Type       string    `json:"type"`
	LastFetch  time.Time `json:"last_fetch"`
	ProxyCount int       `json:"proxy_count"`
	Error      string    `json:"error"`
	Interval   string    `json:"interval"`
	// AvailableCount 该源当前在代理池中验证通过的可用代理数(实时从缓存统计)
	AvailableCount int `json:"available_count"`
}

SourceStatus 记录每个 proxy 源的最近抓取状态,用于 dashboard 展示

func GetSourceStatuses added in v0.2.0

func GetSourceStatuses() []SourceStatus

GetSourceStatuses 返回所有源状态的快照副本(线程安全)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL