Documentation
¶
Overview ¶
Package ccrawl is a Go library for working with Common Crawl data: the collection list, the CDX URL index, WARC/WAT/WET archive files, the columnar Parquet index, CC-NEWS, and the host/domain ranks. It is the engine behind the ccrawl command line tool but is usable on its own.
Index ¶
- Constants
- Variables
- func CDXNumPages(ctx context.Context, h *HTTPClient, crawlID string, q CDXQuery) (int, error)
- func CDXStream(ctx context.Context, h *HTTPClient, crawlID string, q CDXQuery, ...) error
- func CanonicalURL(raw string) string
- func ColumnarParquetURLs(ctx context.Context, h *HTTPClient, cache *Cache, crawlID, subset string, ...) ([]string, error)
- func ColumnarSource(crawlID, subset string, src Source) string
- func ConfigDir() string
- func DownloadFiles(ctx context.Context, h *HTTPClient, src Source, paths []string, ...) error
- func DuckDBAvailable() bool
- func ExtractMarkdown(body []byte) (string, error)
- func ExtractText(body []byte) string
- func ExtractTitle(body []byte) string
- func FetchPaths(ctx context.Context, h *HTTPClient, cache *Cache, crawlID, kind string) ([]string, error)
- func FileURL(path string, src Source) string
- func HTTPBody(block []byte) []byte
- func HTTPHeaders(block []byte) []byte
- func HTTPSURL(path string) string
- func HostOf(raw string) string
- func InferMatchType(pattern string) (cleanURL, matchType string)
- func IterateWARC(r io.Reader, fn func(WARCRecord) error) error
- func IterateWAT(r io.Reader, crawlID string, fn func(WATRecord) error) error
- func IterateWET(r io.Reader, crawlID string, fn func(WETRecord) error) error
- func LibraryDir() string
- func ParquetListLiteral(urls []string) string
- func ResolveCrawl(ctx context.Context, h *HTTPClient, cache *Cache, ref string) (string, error)
- func RunColumnarDuckDB(ctx context.Context, sql string, emit func(map[string]any) error) error
- func RunDuckDBJSON(ctx context.Context, dbPath, sql string, emit func(map[string]any) error) error
- func SURT(raw string) string
- func StreamPaths(ctx context.Context, h *HTTPClient, crawlID, kind string, ...) error
- type CDXQuery
- type CDXRecord
- type Cache
- type ColumnarQuery
- type Config
- type Crawl
- type DownloadResult
- type HTTPClient
- func (h *HTTPClient) FetchBytes(ctx context.Context, url string) ([]byte, error)
- func (h *HTTPClient) Get(ctx context.Context, url string) (*http.Response, error)
- func (h *HTTPClient) GetDownload(ctx context.Context, url string) (*http.Response, error)
- func (h *HTTPClient) GetRange(ctx context.Context, url string, offset, length int64) (*http.Response, error)
- type Library
- type Location
- type NewsFile
- type ParquetWriter
- type Rank
- type Source
- type WARCHeader
- type WARCParquetRow
- type WARCRecord
- type WATLink
- type WATMeta
- type WATParquetRow
- type WATRecord
- type WETParquetRow
- type WETRecord
Constants ¶
const ( CollInfoURL = "https://index.commoncrawl.org/collinfo.json" DataBaseURL = "https://data.commoncrawl.org/" CDXBaseURL = "https://index.commoncrawl.org/" S3BaseURL = "s3://commoncrawl/" // ColumnarPrefix is the root of the columnar (Parquet) index. ColumnarPrefix = "cc-index/table/cc-main/warc/" // UserAgent identifies the client politely to Common Crawl's CDN. UserAgent = "ccrawl/1.0 (+https://github.com/tamnd/ccrawl-cli)" )
Common Crawl endpoints.
const ( DefaultTimeout = 120 * time.Second DefaultRetries = 5 DefaultDelay = 200 * time.Millisecond )
Defaults for the client and downloader.
const DuckDBPrelude = "INSTALL httpfs; LOAD httpfs; SET enable_progress_bar=false; SET allow_asterisks_in_http_paths=true;"
DuckDBPrelude is prepended to every statement ccrawl sends to the duckdb binary. httpfs reads remote Parquet over HTTPS or S3; the progress bar is noise on a pipe; and allow_asterisks_in_http_paths is required because the columnar index is addressed with a glob (subset=warc/*.parquet) over HTTP, which duckdb refuses by default.
Variables ¶
var DefaultColumnarColumns = []string{
"url", "url_host_registered_domain", "fetch_status",
"content_mime_detected", "content_languages",
"warc_filename", "warc_record_offset", "warc_record_length",
}
DefaultColumnarColumns are the columns selected when none are given.
var LocationColumns = []string{"url", "warc_filename", "warc_record_offset", "warc_record_length"}
LocationColumns return just the fields needed to range-fetch a record.
var PathKinds = []string{"warc", "wat", "wet", "robotstxt", "non200responses", "cc-index", "cc-index-table", "segment"}
PathKinds are the file manifests published per crawl.
Functions ¶
func CDXNumPages ¶
CDXNumPages returns the number of result pages for a query.
func CDXStream ¶
func CDXStream(ctx context.Context, h *HTTPClient, crawlID string, q CDXQuery, fn func(CDXRecord) error) error
CDXStream runs a query and calls fn for each matching record, paginating through the server's pages and stopping at q.Limit.
func CanonicalURL ¶
CanonicalURL applies light canonicalization: ensure a scheme, lower-case the host, and drop a fragment. It does not reorder query parameters.
func ColumnarParquetURLs ¶
func ColumnarParquetURLs(ctx context.Context, h *HTTPClient, cache *Cache, crawlID, subset string, src Source) ([]string, error)
ColumnarParquetURLs resolves the columnar index glob into the explicit list of parquet file URLs for one crawl and subset. Common Crawl's bucket does not allow anonymous listing, so a duckdb run cannot expand the `*.parquet` glob over HTTPS (or anonymous S3) on its own. The crawl publishes the full file list in cc-index-table.paths.gz, so we read that manifest (cached) and turn each entry into a fetchable URL for the chosen source.
func ColumnarSource ¶
ColumnarSource returns the parquet glob for one crawl's columnar index subset (subset is warc, crawldiagnostics, or robotstxt).
func DownloadFiles ¶
func DownloadFiles(ctx context.Context, h *HTTPClient, src Source, paths []string, localDir string, workers int, flat bool, progress func(DownloadResult)) error
DownloadFiles fetches a list of Common Crawl relative paths into localDir, concurrently and resumably. progress is called once per file when non-nil.
func DuckDBAvailable ¶
func DuckDBAvailable() bool
DuckDBAvailable reports whether a duckdb binary is on PATH.
func ExtractMarkdown ¶
ExtractMarkdown converts an HTML document to a compact Markdown approximation. It is intentionally light: headings, paragraphs, list items, links, and emphasis, which covers the bulk of crawled article content.
func ExtractText ¶
ExtractText returns readable plain text from an HTML document, dropping the contents of script and style elements and collapsing whitespace.
func ExtractTitle ¶
ExtractTitle returns the <title> text of an HTML document.
func FetchPaths ¶
func FetchPaths(ctx context.Context, h *HTTPClient, cache *Cache, crawlID, kind string) ([]string, error)
FetchPaths downloads and decompresses a crawl's path manifest.
func FileURL ¶
FileURL turns a Common Crawl relative path into a fetchable URL for the given source. HTTPS uses the CloudFront mirror; S3 uses the bucket URI.
func HTTPBody ¶
HTTPBody splits a response block at the header/body boundary and returns the body. It returns the whole block when no boundary is found.
func HTTPHeaders ¶
HTTPHeaders returns the header section (status line + headers) of a response block, without the body.
func HTTPSURL ¶
HTTPSURL always returns the HTTPS mirror URL (used for control-plane fetches like manifests and collinfo regardless of the bulk source).
func InferMatchType ¶
InferMatchType guesses the CDX matchType from a user-supplied URL pattern. "*.example.com" -> domain, "example.com/*" -> prefix, otherwise exact unless the caller already chose host/domain/prefix.
func IterateWARC ¶
func IterateWARC(r io.Reader, fn func(WARCRecord) error) error
IterateWARC reads a WARC file (a multi-member gzip stream where each member is one record) and calls fn for every record. The parser lives in pkg/warc.
func IterateWAT ¶
IterateWAT reads a WAT file and calls fn for each parsed record. The parser lives in pkg/wat.
func IterateWET ¶
IterateWET reads a WET file (WARC conversion records holding plain text) and calls fn for each record. The parser lives in pkg/wet.
func LibraryDir ¶
func LibraryDir() string
LibraryDir is the root of the structured dataset library that the --library flag downloads into and processes from. It is deliberately separate from the data dir: the data dir (see Config) holds ad-hoc downloads, the cache, and the local DuckDB file, while the library is a curated, browsable corpus you build up over time. CCRAWL_LIBRARY overrides the default of ~/notes/ccrawl.
func ParquetListLiteral ¶
ParquetListLiteral renders parquet URLs as a duckdb list literal, e.g. ['https://a', 'https://b'], suitable as the argument to read_parquet.
func ResolveCrawl ¶
ResolveCrawl turns a loose reference into a canonical crawl ID.
"latest" -> newest crawl "CC-MAIN-2024-51" -> itself "2024-51" -> "CC-MAIN-2024-51" "2024" -> newest crawl whose ID starts with CC-MAIN-2024
func RunColumnarDuckDB ¶
RunColumnarDuckDB executes sql with the local duckdb binary, installing the httpfs extension for S3/HTTPS parquet access, and streams JSON rows to emit.
func RunDuckDBJSON ¶
RunDuckDBJSON runs sql with the local duckdb binary and streams JSON rows to emit. An empty dbPath runs against an in-memory database; a path opens (and creates) a persistent database file. httpfs is loaded so remote parquet over HTTPS or S3 works either way.
func SURT ¶
SURT converts a URL into a Sort-friendly URI Reordering Transform key, the canonical form Common Crawl uses to sort and group its index. For example "https://www.example.com/a/b?q=1" becomes "com,example,www)/a/b?q=1".
The transform lower-cases the scheme and host, reverses the host labels, drops a leading "www.", strips the default port, and keeps the path and query.
func StreamPaths ¶
func StreamPaths(ctx context.Context, h *HTTPClient, crawlID, kind string, fn func(string) error) error
StreamPaths streams a crawl's path manifest one path at a time.
Types ¶
type CDXQuery ¶
type CDXQuery struct {
URL string // URL or pattern
Match string // exact|prefix|host|domain (empty -> inferred from URL)
From string // 14-digit (or loose) lower time bound
To string // 14-digit (or loose) upper time bound
Status string // HTTP status filter (e.g. "200")
MIME string // mime-detected filter
Lang string // languages filter (ISO-639-3)
Filter []string
Limit int
}
CDXQuery describes a query against the CDX URL index.
type CDXRecord ¶
type CDXRecord struct {
CrawlID string `json:"crawl,omitempty"`
URLKey string `json:"urlkey"`
Timestamp string `json:"timestamp"` // 14-digit YYYYMMDDHHmmss
URL string `json:"url"`
MIME string `json:"mime"`
MIMEDetected string `json:"mime-detected"`
Status string `json:"status"`
Digest string `json:"digest"`
Length string `json:"length"`
Offset string `json:"offset"`
Filename string `json:"filename"`
Charset string `json:"charset,omitempty"`
Languages string `json:"languages,omitempty"`
Truncated string `json:"truncated,omitempty"`
Redirect string `json:"redirect,omitempty"`
}
CDXRecord is one capture from the URL index. Numeric fields stay as strings because that is how the CDX server returns them; helpers convert on demand.
type Cache ¶
type Cache struct {
// contains filtered or unexported fields
}
Cache is a tiny on-disk blob cache keyed by an arbitrary string, with a TTL per entry. It is safe for the simple single-process use the CLI makes of it.
func NewCache ¶
NewCache returns a cache rooted under dir. If dir is empty or enabled is false, all operations are no-ops (cache miss on every Get).
type ColumnarQuery ¶
type ColumnarQuery struct {
Crawl string
Subset string // warc (default) | crawldiagnostics | robotstxt
Domain string // url_host_registered_domain
Host string // url_host_name
TLD string // url_host_tld
MIME string // content_mime_detected
Lang string // content_languages (substring match)
PathPrefix string // url_path prefix
Status int // fetch_status (0 = any)
Select []string
Limit int
}
ColumnarQuery builds SQL against the columnar (Parquet) index. The zero value selects everything; set fields to add WHERE clauses.
func (ColumnarQuery) SQL ¶
func (q ColumnarQuery) SQL(src Source) string
SQL renders the query as a runnable DuckDB statement reading parquet over the given source. The same text runs in Athena or Spark after swapping read_parquet for the engine's table reference.
type Config ¶
type Config struct {
DataDir string
CacheDir string
DBPath string
Source Source
Workers int
Timeout time.Duration
Delay time.Duration
Retries int
UserAgent string
CrawlID string
}
Config controls library behaviour. The zero value is not usable; call DefaultConfig and adjust.
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns a Config rooted at the XDG data/cache directories, with the most recent crawl resolved lazily (CrawlID == "latest").
func (Config) ParquetDir ¶
ParquetDir is where converted Parquet files land.
type Crawl ¶
type Crawl struct {
ID string `json:"id"`
Name string `json:"name"`
CDXAPI string `json:"cdx-api"`
From string `json:"from,omitempty"`
To string `json:"to,omitempty"`
}
Crawl is one Common Crawl collection as published in collinfo.json.
func ListCrawls ¶
ListCrawls fetches and parses collinfo.json. Results are cached when a cache is supplied (pass nil to skip).
type DownloadResult ¶
DownloadResult is the outcome of fetching one file.
type HTTPClient ¶
type HTTPClient struct {
// contains filtered or unexported fields
}
HTTPClient is a polite, retrying HTTP client for Common Crawl. It rate-limits requests, retries on 429/5xx with linear backoff, and supports byte-range requests for single-record retrieval.
func NewHTTPClient ¶
func NewHTTPClient(cfg Config) *HTTPClient
NewHTTPClient builds an HTTPClient from cfg.
func (*HTTPClient) FetchBytes ¶
FetchBytes fetches url and returns the whole body.
func (*HTTPClient) GetDownload ¶
GetDownload fetches url with no client timeout (relies on ctx cancellation), for large archive bodies.
type Library ¶
Library is a structured corpus of Common Crawl archive files for one crawl, rooted at Root. The layout is predictable so a directory listing tells you exactly what you have:
<root>/<crawl>/<kind>/<file>.gz raw downloaded archives <root>/<crawl>/<format>/<kind>/<file>.<ext> processed output (parquet|jsonl)
Files are stored flat under each kind by their base name. A Common Crawl file name already encodes its segment and timestamp and is unique within a crawl, so the base name alone is a safe, stable key with no risk of collision.
func NewLibrary ¶
NewLibrary returns a Library rooted at root (or LibraryDir() when root is empty) for the given crawl ID.
func (Library) CrawlDir ¶
CrawlDir is the per-crawl root, the parent of every kind and format directory.
func (Library) ProcessedDir ¶
ProcessedDir is where processed output of a kind lives, grouped by format (parquet or jsonl) so the same archives can be materialised more than one way side by side.
type Location ¶
type Location struct {
Filename string `json:"filename"`
Offset int64 `json:"offset"`
Length int64 `json:"length"`
URL string `json:"url,omitempty"`
}
Location is the WARC file plus byte span needed to range-fetch this capture.
type NewsFile ¶
NewsFile describes one CC-NEWS WARC file.
func ListNewsFiles ¶
ListNewsFiles returns the CC-NEWS WARC files for a year and month. Pass month 0 to list every month of a year; pass year 0 to list everything found via the index page.
type ParquetWriter ¶
type ParquetWriter[T any] struct { // contains filtered or unexported fields }
ParquetWriter writes rows of type T to a zstd-compressed Parquet file.
func NewParquetWriter ¶
func NewParquetWriter[T any](path string) (*ParquetWriter[T], error)
NewParquetWriter creates a Parquet writer for path.
func (*ParquetWriter[T]) Close ¶
func (p *ParquetWriter[T]) Close() error
Close flushes and closes the file.
func (*ParquetWriter[T]) Rows ¶
func (p *ParquetWriter[T]) Rows() int64
Rows returns the number of rows written.
func (*ParquetWriter[T]) Write ¶
func (p *ParquetWriter[T]) Write(row T) error
Write appends one row.
type Rank ¶
type Rank struct {
Key string `json:"key"` // host or domain (forward form)
HarmonicPos int64 `json:"harmonic_pos"`
HarmonicVal float64 `json:"harmonic_val"`
PageRankPos int64 `json:"pagerank_pos"`
PageRankVal float64 `json:"pagerank_val"`
}
Rank is a host/domain entry from the web-graph rank tables.
func RankLookup ¶
RankLookup streams a gzipped rank table from url and returns the entry whose reversed key matches the given host or domain, or a not-found error.
type WARCParquetRow ¶
type WARCParquetRow struct {
RecordID string `parquet:"record_id,dict"`
CrawlID string `parquet:"crawl_id,dict"`
WARCType string `parquet:"warc_type,dict"`
TargetURI string `parquet:"target_uri"`
Date time.Time `parquet:"date,timestamp(microsecond)"`
IPAddress string `parquet:"ip_address,dict"`
PayloadDigest string `parquet:"payload_digest"`
ContentType string `parquet:"content_type,dict"`
ContentLength int64 `parquet:"content_length"`
Truncated string `parquet:"truncated,dict"`
HTTPStatus int32 `parquet:"http_status"`
HTTPMIME string `parquet:"http_mime,dict"`
WARCFilename string `parquet:"warc_filename,dict"`
WARCOffset int64 `parquet:"warc_offset"`
WARCLength int64 `parquet:"warc_length"`
Title string `parquet:"title"`
Language string `parquet:"language,dict"`
Markdown string `parquet:"markdown"`
Text string `parquet:"text"`
}
WARCParquetRow is the columnar schema for parsed WARC record metadata. When a response body is converted, the content fields are populated too.
type WARCRecord ¶
WARCRecord is a parsed WARC record: its header and the raw block bytes. For a response record the block is the full HTTP message (status line, headers, body).
func FetchWARCRecord ¶
func FetchWARCRecord(ctx context.Context, h *HTTPClient, filename string, offset, length int64) (WARCRecord, error)
FetchWARCRecord retrieves a single WARC record from the given file using a byte-range request. This is how a capture's content is pulled without downloading the whole multi-gigabyte WARC.
type WATLink ¶
WATLink is a hyperlink extracted from page HTML.
func ExtractLinks ¶
ExtractLinks returns the outbound hyperlinks of an HTML document, resolved against base when possible.
type WATParquetRow ¶
type WATParquetRow struct {
RecordID string `parquet:"record_id,dict"`
CrawlID string `parquet:"crawl_id,dict"`
URL string `parquet:"url"`
Date time.Time `parquet:"date,timestamp(microsecond)"`
HTTPStatus int32 `parquet:"http_status"`
ContentType string `parquet:"content_type,dict"`
Title string `parquet:"title"`
LinksCount int32 `parquet:"links_count"`
Links string `parquet:"links"` // JSON
Metas string `parquet:"metas"` // JSON
}
WATParquetRow is the columnar schema for WAT link and metadata records.
type WETParquetRow ¶
type WETParquetRow struct {
RecordID string `parquet:"record_id,dict"`
CrawlID string `parquet:"crawl_id,dict"`
URL string `parquet:"url"`
Date time.Time `parquet:"date,timestamp(microsecond)"`
ContentLanguage string `parquet:"content_language,dict"`
TextLength int32 `parquet:"text_length"`
Text string `parquet:"text"`
}
WETParquetRow is the columnar schema for WET plain-text records.