Documentation
¶
Overview ¶
Package warc reads WARC files the way Common Crawl stores them: a stream of gzip members, one record per member. It decodes records without buffering the whole file, so a single record can be pulled from an HTTP byte range, and it exposes the small helpers needed to split an HTTP response block into its headers and body.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func HTTPBody ¶
HTTPBody splits a response block at the header/body boundary and returns the body. It returns the whole block when no boundary is found.
func HTTPHeaders ¶
HTTPHeaders returns the header section (status line + headers) of a response block, without the body.
func Iterate ¶
Iterate reads a WARC file (a multi-member gzip stream where each member is one record) and calls fn for every record.
The whole input is wrapped in one *bufio.Reader and the SAME reader is handed to gz.Reset on each member boundary. klauspost/compress/gzip keeps that buffered reader (z.r = rb), so read-ahead bytes from the previous member start the next member correctly and no full-file buffering is needed. This is what makes fetching a single record over an HTTP byte range work.
Types ¶
type Header ¶
type Header struct {
Type string // warcinfo|request|response|metadata|revisit|conversion|resource
Date time.Time
RecordID string
TargetURI string
IPAddress string
ConcurrentTo string
WarcinfoID string
BlockDigest string
PayloadDigest string
RefersTo string
Truncated string
ContentType string
ContentLength int64
Language string // WARC-Identified-Content-Language (WET records)
// Response records only: extracted HTTP fields.
HTTPStatus int
HTTPMIME string
// Source location for range-request retrieval.
WARCFilename string
WARCOffset int64
WARCLength int64
}
Header holds parsed WARC record headers.