Documentation
¶
Overview ¶
Package collect provides a data collection subsystem for gathering information from multiple sources including GitHub, BitcoinTalk, CoinGecko, and academic paper repositories. It supports rate limiting, incremental state tracking, and event-driven progress reporting.
Index ¶
- Constants
- func FormatMarketSummary(data *coinData) string
- func FormatPaperMarkdown(title string, authors []string, date, paperURL, source, abstract string) string
- func FormatPostMarkdown(num int, author, date, content string) string
- func HTMLToMarkdown(content string) (string, error)
- func JSONToMarkdown(content string) (string, error)
- func ParsePostsFromHTML(htmlContent string) ([]btPost, error)
- func SetHTTPClient(c *http.Client)
- type BitcoinTalkCollector
- type BitcoinTalkCollectorWithFetcher
- type Collector
- type Config
- type Dispatcher
- func (d *Dispatcher) Emit(event Event)
- func (d *Dispatcher) EmitComplete(source, message string, data any)
- func (d *Dispatcher) EmitError(source, message string, data any)
- func (d *Dispatcher) EmitItem(source, message string, data any)
- func (d *Dispatcher) EmitProgress(source, message string, data any)
- func (d *Dispatcher) EmitStart(source, message string)
- func (d *Dispatcher) On(eventType string, handler EventHandler)
- type Event
- type EventHandler
- type Excavator
- type FetchPageFunc
- type GitHubCollector
- type MarketCollector
- type PapersCollector
- type Processor
- type RateLimiter
- type Result
- type State
- type StateEntry
Constants ¶
const ( // EventStart is emitted when a collector begins its run. EventStart = "start" // EventProgress is emitted to report incremental progress. EventProgress = "progress" // EventItem is emitted when a single item is collected. EventItem = "item" // EventError is emitted when an error occurs during collection. EventError = "error" // EventComplete is emitted when a collector finishes its run. EventComplete = "complete" )
Event types used by the collection subsystem.
const ( PaperSourceIACR = "iacr" PaperSourceArXiv = "arxiv" PaperSourceAll = "all" )
Paper source identifiers.
Variables ¶
This section is empty.
Functions ¶
func FormatMarketSummary ¶
func FormatMarketSummary(data *coinData) string
FormatMarketSummary is exported for testing.
func FormatPaperMarkdown ¶
func FormatPaperMarkdown(title string, authors []string, date, paperURL, source, abstract string) string
FormatPaperMarkdown is exported for testing.
func FormatPostMarkdown ¶
FormatPostMarkdown is exported for testing purposes.
func HTMLToMarkdown ¶
HTMLToMarkdown is exported for testing.
func JSONToMarkdown ¶
JSONToMarkdown is exported for testing.
func ParsePostsFromHTML ¶
ParsePostsFromHTML parses BitcoinTalk posts from raw HTML content. This is exported for testing purposes.
func SetHTTPClient ¶
SetHTTPClient replaces the package-level HTTP client. Use this in tests to inject a custom transport or timeout.
Types ¶
type BitcoinTalkCollector ¶
type BitcoinTalkCollector struct {
// TopicID is the numeric topic identifier.
TopicID string
// URL is a full URL to a BitcoinTalk topic page. If set, TopicID is
// extracted from it.
URL string
// Pages limits collection to this many pages. 0 means all pages.
Pages int
}
BitcoinTalkCollector collects forum posts from BitcoinTalk.
func (*BitcoinTalkCollector) Name ¶
func (b *BitcoinTalkCollector) Name() string
Name returns the collector name.
type BitcoinTalkCollectorWithFetcher ¶
type BitcoinTalkCollectorWithFetcher struct {
BitcoinTalkCollector
Fetcher FetchPageFunc
}
BitcoinTalkCollectorWithFetcher wraps BitcoinTalkCollector with a custom fetcher for testing.
type Collector ¶
type Collector interface {
// Name returns a human-readable name for this collector.
Name() string
// Collect gathers data from the source and writes it to the configured output.
Collect(ctx context.Context, cfg *Config) (*Result, error)
}
Collector is the interface all collection sources implement.
type Config ¶
type Config struct {
// Output is the storage medium for writing collected data.
Output io.Medium
// OutputDir is the base directory for all collected data.
OutputDir string
// Limiter provides per-source rate limiting.
Limiter *RateLimiter
// State tracks collection progress for incremental runs.
State *State
// Dispatcher manages event dispatch for progress reporting.
Dispatcher *Dispatcher
// Verbose enables detailed logging output.
Verbose bool
// DryRun simulates collection without writing files.
DryRun bool
}
Config holds shared configuration for all collectors.
type Dispatcher ¶
type Dispatcher struct {
// contains filtered or unexported fields
}
Dispatcher manages event dispatch. Handlers are registered per event type and are called synchronously when an event is emitted.
func (*Dispatcher) Emit ¶
func (d *Dispatcher) Emit(event Event)
Emit dispatches an event to all registered handlers for that event type. If no handlers are registered for the event type, the event is silently dropped. The event's Time field is set to now if it is zero.
func (*Dispatcher) EmitComplete ¶
func (d *Dispatcher) EmitComplete(source, message string, data any)
EmitComplete emits a complete event.
func (*Dispatcher) EmitError ¶
func (d *Dispatcher) EmitError(source, message string, data any)
EmitError emits an error event.
func (*Dispatcher) EmitItem ¶
func (d *Dispatcher) EmitItem(source, message string, data any)
EmitItem emits an item event.
func (*Dispatcher) EmitProgress ¶
func (d *Dispatcher) EmitProgress(source, message string, data any)
EmitProgress emits a progress event.
func (*Dispatcher) EmitStart ¶
func (d *Dispatcher) EmitStart(source, message string)
EmitStart emits a start event for the given source.
func (*Dispatcher) On ¶
func (d *Dispatcher) On(eventType string, handler EventHandler)
On registers a handler for an event type. Multiple handlers can be registered for the same event type and will be called in order.
type Event ¶
type Event struct {
// Type is one of the Event* constants.
Type string `json:"type"`
// Source identifies the collector that emitted the event.
Source string `json:"source"`
// Message is a human-readable description of the event.
Message string `json:"message"`
// Data carries optional event-specific payload.
Data any `json:"data,omitempty"`
// Time is when the event occurred.
Time time.Time `json:"time"`
}
Event represents a collection event.
type Excavator ¶
type Excavator struct {
// Collectors is the list of collectors to run.
Collectors []Collector
// ScanOnly reports what would be collected without performing collection.
ScanOnly bool
// Resume enables incremental collection using saved state.
Resume bool
}
Excavator runs multiple collectors as a coordinated operation. It provides sequential execution with rate limit respect, state tracking for resume support, and aggregated results.
type FetchPageFunc ¶
FetchPageFunc is an injectable function type for fetching pages, used in testing.
type GitHubCollector ¶
type GitHubCollector struct {
// Org is the GitHub organisation.
Org string
// Repo is the repository name. If empty and Org is set, all repos are collected.
Repo string
// IssuesOnly limits collection to issues (excludes PRs).
IssuesOnly bool
// PRsOnly limits collection to PRs (excludes issues).
PRsOnly bool
}
GitHubCollector collects issues and PRs from GitHub repositories.
func (*GitHubCollector) Name ¶
func (g *GitHubCollector) Name() string
Name returns the collector name.
type MarketCollector ¶
type MarketCollector struct {
// CoinID is the CoinGecko coin identifier (e.g. "bitcoin", "ethereum").
CoinID string
// Historical enables collection of historical market chart data.
Historical bool
// FromDate is the start date for historical data in YYYY-MM-DD format.
FromDate string
}
MarketCollector collects market data from CoinGecko.
func (*MarketCollector) Name ¶
func (m *MarketCollector) Name() string
Name returns the collector name.
type PapersCollector ¶
type PapersCollector struct {
// Source is one of PaperSourceIACR, PaperSourceArXiv, or PaperSourceAll.
Source string
// Category is the arXiv category (e.g. "cs.CR" for cryptography).
Category string
// Query is the search query string.
Query string
}
PapersCollector collects papers from IACR and arXiv.
func (*PapersCollector) Name ¶
func (p *PapersCollector) Name() string
Name returns the collector name.
type Processor ¶
type Processor struct {
// Source identifies the data source directory to process.
Source string
// Dir is the directory containing files to process.
Dir string
}
Processor converts collected data to clean markdown.
type RateLimiter ¶
type RateLimiter struct {
// contains filtered or unexported fields
}
RateLimiter tracks per-source rate limiting to avoid overwhelming APIs.
func NewRateLimiter ¶
func NewRateLimiter() *RateLimiter
NewRateLimiter creates a limiter with default delays.
func (*RateLimiter) CheckGitHubRateLimit ¶
func (r *RateLimiter) CheckGitHubRateLimit() (used, limit int, err error)
CheckGitHubRateLimit checks GitHub API rate limit status via gh api. Returns used and limit counts. Auto-pauses at 75% usage by increasing the GitHub rate limit delay.
func (*RateLimiter) GetDelay ¶
func (r *RateLimiter) GetDelay(source string) time.Duration
GetDelay returns the delay configured for a source.
type Result ¶
type Result struct {
// Source identifies which collector produced this result.
Source string
// Items is the number of items successfully collected.
Items int
// Errors is the number of errors encountered during collection.
Errors int
// Skipped is the number of items skipped (e.g. already collected).
Skipped int
// Files lists the paths of all files written.
Files []string
}
Result holds the output of a collection run.
func MergeResults ¶
MergeResults combines multiple results into a single aggregated result.
type State ¶
type State struct {
// contains filtered or unexported fields
}
State tracks collection progress for incremental runs. It persists entries to disk so that subsequent runs can resume where they left off.
func NewState ¶
NewState creates a state tracker that persists to the given path using the provided storage medium.
func (*State) Get ¶
func (s *State) Get(source string) (*StateEntry, bool)
Get returns a copy of the state for a source. The second return value indicates whether the entry was found.
func (*State) Load ¶
Load reads state from disk. If the file does not exist, the state is initialised as empty without error.
func (*State) Set ¶
func (s *State) Set(source string, entry *StateEntry)
Set updates state for a source.
type StateEntry ¶
type StateEntry struct {
// Source identifies the collector.
Source string `json:"source"`
// LastRun is the timestamp of the last successful run.
LastRun time.Time `json:"last_run"`
// LastID is an opaque identifier for the last item processed.
LastID string `json:"last_id,omitempty"`
// Items is the total number of items collected so far.
Items int `json:"items"`
// Cursor is an opaque pagination cursor for resumption.
Cursor string `json:"cursor,omitempty"`
}
StateEntry tracks state for one source.