collect

package
v0.0.4-alpha.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: EUPL-1.2 Imports: 16 Imported by: 0

Documentation

Overview

Package collect provides a data collection subsystem for gathering information from multiple sources including GitHub, BitcoinTalk, CoinGecko, and academic paper repositories. It supports rate limiting, incremental state tracking, and event-driven progress reporting.

Index

Constants

View Source
const (
	// EventStart is emitted when a collector begins its run.
	EventStart = "start"

	// EventProgress is emitted to report incremental progress.
	EventProgress = "progress"

	// EventItem is emitted when a single item is collected.
	EventItem = "item"

	// EventError is emitted when an error occurs during collection.
	EventError = "error"

	// EventComplete is emitted when a collector finishes its run.
	EventComplete = "complete"
)

Event types used by the collection subsystem.

View Source
const (
	PaperSourceIACR  = "iacr"
	PaperSourceArXiv = "arxiv"
	PaperSourceAll   = "all"
)

Paper source identifiers.

Variables

This section is empty.

Functions

func FormatMarketSummary

func FormatMarketSummary(data *coinData) string

FormatMarketSummary is exported for testing.

func FormatPaperMarkdown

func FormatPaperMarkdown(title string, authors []string, date, paperURL, source, abstract string) string

FormatPaperMarkdown is exported for testing.

func FormatPostMarkdown

func FormatPostMarkdown(num int, author, date, content string) string

FormatPostMarkdown is exported for testing purposes.

func HTMLToMarkdown

func HTMLToMarkdown(content string) (string, error)

HTMLToMarkdown is exported for testing.

func JSONToMarkdown

func JSONToMarkdown(content string) (string, error)

JSONToMarkdown is exported for testing.

func ParsePostsFromHTML

func ParsePostsFromHTML(htmlContent string) ([]btPost, error)

ParsePostsFromHTML parses BitcoinTalk posts from raw HTML content. This is exported for testing purposes.

func SetHTTPClient

func SetHTTPClient(c *http.Client)

SetHTTPClient replaces the package-level HTTP client. Use this in tests to inject a custom transport or timeout.

Types

type BitcoinTalkCollector

type BitcoinTalkCollector struct {
	// TopicID is the numeric topic identifier.
	TopicID string

	// URL is a full URL to a BitcoinTalk topic page. If set, TopicID is
	// extracted from it.
	URL string

	// Pages limits collection to this many pages. 0 means all pages.
	Pages int
}

BitcoinTalkCollector collects forum posts from BitcoinTalk.

func (*BitcoinTalkCollector) Collect

func (b *BitcoinTalkCollector) Collect(ctx context.Context, cfg *Config) (*Result, error)

Collect gathers posts from a BitcoinTalk topic.

func (*BitcoinTalkCollector) Name

func (b *BitcoinTalkCollector) Name() string

Name returns the collector name.

type BitcoinTalkCollectorWithFetcher

type BitcoinTalkCollectorWithFetcher struct {
	BitcoinTalkCollector
	Fetcher FetchPageFunc
}

BitcoinTalkCollectorWithFetcher wraps BitcoinTalkCollector with a custom fetcher for testing.

type Collector

type Collector interface {
	// Name returns a human-readable name for this collector.
	Name() string

	// Collect gathers data from the source and writes it to the configured output.
	Collect(ctx context.Context, cfg *Config) (*Result, error)
}

Collector is the interface all collection sources implement.

type Config

type Config struct {
	// Output is the storage medium for writing collected data.
	Output io.Medium

	// OutputDir is the base directory for all collected data.
	OutputDir string

	// Limiter provides per-source rate limiting.
	Limiter *RateLimiter

	// State tracks collection progress for incremental runs.
	State *State

	// Dispatcher manages event dispatch for progress reporting.
	Dispatcher *Dispatcher

	// Verbose enables detailed logging output.
	Verbose bool

	// DryRun simulates collection without writing files.
	DryRun bool
}

Config holds shared configuration for all collectors.

func NewConfig

func NewConfig(outputDir string) *Config

NewConfig creates a Config with sensible defaults. It initialises a MockMedium for output if none is provided, sets up a rate limiter, state tracker, and event dispatcher.

func NewConfigWithMedium

func NewConfigWithMedium(m io.Medium, outputDir string) *Config

NewConfigWithMedium creates a Config using the specified storage medium.

type Dispatcher

type Dispatcher struct {
	// contains filtered or unexported fields
}

Dispatcher manages event dispatch. Handlers are registered per event type and are called synchronously when an event is emitted.

func NewDispatcher

func NewDispatcher() *Dispatcher

NewDispatcher creates a new event dispatcher.

func (*Dispatcher) Emit

func (d *Dispatcher) Emit(event Event)

Emit dispatches an event to all registered handlers for that event type. If no handlers are registered for the event type, the event is silently dropped. The event's Time field is set to now if it is zero.

func (*Dispatcher) EmitComplete

func (d *Dispatcher) EmitComplete(source, message string, data any)

EmitComplete emits a complete event.

func (*Dispatcher) EmitError

func (d *Dispatcher) EmitError(source, message string, data any)

EmitError emits an error event.

func (*Dispatcher) EmitItem

func (d *Dispatcher) EmitItem(source, message string, data any)

EmitItem emits an item event.

func (*Dispatcher) EmitProgress

func (d *Dispatcher) EmitProgress(source, message string, data any)

EmitProgress emits a progress event.

func (*Dispatcher) EmitStart

func (d *Dispatcher) EmitStart(source, message string)

EmitStart emits a start event for the given source.

func (*Dispatcher) On

func (d *Dispatcher) On(eventType string, handler EventHandler)

On registers a handler for an event type. Multiple handlers can be registered for the same event type and will be called in order.

type Event

type Event struct {
	// Type is one of the Event* constants.
	Type string `json:"type"`

	// Source identifies the collector that emitted the event.
	Source string `json:"source"`

	// Message is a human-readable description of the event.
	Message string `json:"message"`

	// Data carries optional event-specific payload.
	Data any `json:"data,omitempty"`

	// Time is when the event occurred.
	Time time.Time `json:"time"`
}

Event represents a collection event.

type EventHandler

type EventHandler func(Event)

EventHandler handles collection events.

type Excavator

type Excavator struct {
	// Collectors is the list of collectors to run.
	Collectors []Collector

	// ScanOnly reports what would be collected without performing collection.
	ScanOnly bool

	// Resume enables incremental collection using saved state.
	Resume bool
}

Excavator runs multiple collectors as a coordinated operation. It provides sequential execution with rate limit respect, state tracking for resume support, and aggregated results.

func (*Excavator) Name

func (e *Excavator) Name() string

Name returns the orchestrator name.

func (*Excavator) Run

func (e *Excavator) Run(ctx context.Context, cfg *Config) (*Result, error)

Run executes all collectors sequentially, respecting rate limits and using state for resume support. Results are aggregated from all collectors.

type FetchPageFunc

type FetchPageFunc func(ctx context.Context, url string) ([]btPost, error)

FetchPageFunc is an injectable function type for fetching pages, used in testing.

type GitHubCollector

type GitHubCollector struct {
	// Org is the GitHub organisation.
	Org string

	// Repo is the repository name. If empty and Org is set, all repos are collected.
	Repo string

	// IssuesOnly limits collection to issues (excludes PRs).
	IssuesOnly bool

	// PRsOnly limits collection to PRs (excludes issues).
	PRsOnly bool
}

GitHubCollector collects issues and PRs from GitHub repositories.

func (*GitHubCollector) Collect

func (g *GitHubCollector) Collect(ctx context.Context, cfg *Config) (*Result, error)

Collect gathers issues and/or PRs from GitHub repositories.

func (*GitHubCollector) Name

func (g *GitHubCollector) Name() string

Name returns the collector name.

type MarketCollector

type MarketCollector struct {
	// CoinID is the CoinGecko coin identifier (e.g. "bitcoin", "ethereum").
	CoinID string

	// Historical enables collection of historical market chart data.
	Historical bool

	// FromDate is the start date for historical data in YYYY-MM-DD format.
	FromDate string
}

MarketCollector collects market data from CoinGecko.

func (*MarketCollector) Collect

func (m *MarketCollector) Collect(ctx context.Context, cfg *Config) (*Result, error)

Collect gathers market data from CoinGecko.

func (*MarketCollector) Name

func (m *MarketCollector) Name() string

Name returns the collector name.

type PapersCollector

type PapersCollector struct {
	// Source is one of PaperSourceIACR, PaperSourceArXiv, or PaperSourceAll.
	Source string

	// Category is the arXiv category (e.g. "cs.CR" for cryptography).
	Category string

	// Query is the search query string.
	Query string
}

PapersCollector collects papers from IACR and arXiv.

func (*PapersCollector) Collect

func (p *PapersCollector) Collect(ctx context.Context, cfg *Config) (*Result, error)

Collect gathers papers from the configured sources.

func (*PapersCollector) Name

func (p *PapersCollector) Name() string

Name returns the collector name.

type Processor

type Processor struct {
	// Source identifies the data source directory to process.
	Source string

	// Dir is the directory containing files to process.
	Dir string
}

Processor converts collected data to clean markdown.

func (*Processor) Name

func (p *Processor) Name() string

Name returns the processor name.

func (*Processor) Process

func (p *Processor) Process(ctx context.Context, cfg *Config) (*Result, error)

Process reads files from the source directory, converts HTML or JSON to clean markdown, and writes the results to the output directory.

type RateLimiter

type RateLimiter struct {
	// contains filtered or unexported fields
}

RateLimiter tracks per-source rate limiting to avoid overwhelming APIs.

func NewRateLimiter

func NewRateLimiter() *RateLimiter

NewRateLimiter creates a limiter with default delays.

func (*RateLimiter) CheckGitHubRateLimit

func (r *RateLimiter) CheckGitHubRateLimit() (used, limit int, err error)

CheckGitHubRateLimit checks GitHub API rate limit status via gh api. Returns used and limit counts. Auto-pauses at 75% usage by increasing the GitHub rate limit delay.

func (*RateLimiter) GetDelay

func (r *RateLimiter) GetDelay(source string) time.Duration

GetDelay returns the delay configured for a source.

func (*RateLimiter) SetDelay

func (r *RateLimiter) SetDelay(source string, d time.Duration)

SetDelay sets the delay for a source.

func (*RateLimiter) Wait

func (r *RateLimiter) Wait(ctx context.Context, source string) error

Wait blocks until the rate limit allows the next request for the given source. It respects context cancellation.

type Result

type Result struct {
	// Source identifies which collector produced this result.
	Source string

	// Items is the number of items successfully collected.
	Items int

	// Errors is the number of errors encountered during collection.
	Errors int

	// Skipped is the number of items skipped (e.g. already collected).
	Skipped int

	// Files lists the paths of all files written.
	Files []string
}

Result holds the output of a collection run.

func MergeResults

func MergeResults(source string, results ...*Result) *Result

MergeResults combines multiple results into a single aggregated result.

type State

type State struct {
	// contains filtered or unexported fields
}

State tracks collection progress for incremental runs. It persists entries to disk so that subsequent runs can resume where they left off.

func NewState

func NewState(m io.Medium, path string) *State

NewState creates a state tracker that persists to the given path using the provided storage medium.

func (*State) Get

func (s *State) Get(source string) (*StateEntry, bool)

Get returns a copy of the state for a source. The second return value indicates whether the entry was found.

func (*State) Load

func (s *State) Load() error

Load reads state from disk. If the file does not exist, the state is initialised as empty without error.

func (*State) Save

func (s *State) Save() error

Save writes state to disk.

func (*State) Set

func (s *State) Set(source string, entry *StateEntry)

Set updates state for a source.

type StateEntry

type StateEntry struct {
	// Source identifies the collector.
	Source string `json:"source"`

	// LastRun is the timestamp of the last successful run.
	LastRun time.Time `json:"last_run"`

	// LastID is an opaque identifier for the last item processed.
	LastID string `json:"last_id,omitempty"`

	// Items is the total number of items collected so far.
	Items int `json:"items"`

	// Cursor is an opaque pagination cursor for resumption.
	Cursor string `json:"cursor,omitempty"`
}

StateEntry tracks state for one source.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL