warc

package

v0.1.0 Latest Latest Go to latest Published: Jun 13, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tamnd/ccrawl-cli

Links

Open Source Insights

Documentation ¶

Overview ¶

Package warc reads WARC files the way Common Crawl stores them: a stream of gzip members, one record per member. It decodes records without buffering the whole file, so a single record can be pulled from an HTTP byte range, and it exposes the small helpers needed to split an HTTP response block into its headers and body.

Index ¶

func HTTPBody(block []byte) []byte
func HTTPHeaders(block []byte) []byte
func Iterate(r io.Reader, fn func(Record) error) error
func TrimURI(s string) string
type Header
type Record

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func HTTPBody ¶

func HTTPBody(block []byte) []byte

HTTPBody splits a response block at the header/body boundary and returns the body. It returns the whole block when no boundary is found.

func HTTPHeaders ¶

func HTTPHeaders(block []byte) []byte

HTTPHeaders returns the header section (status line + headers) of a response block, without the body.

func Iterate ¶

func Iterate(r io.Reader, fn func(Record) error) error

Iterate reads a WARC file (a multi-member gzip stream where each member is one record) and calls fn for every record.

The whole input is wrapped in one *bufio.Reader and the SAME reader is handed to gz.Reset on each member boundary. klauspost/compress/gzip keeps that buffered reader (z.r = rb), so read-ahead bytes from the previous member start the next member correctly and no full-file buffering is needed. This is what makes fetching a single record over an HTTP byte range work.

func TrimURI ¶

func TrimURI(s string) string

TrimURI removes the angle brackets WARC sometimes wraps URIs in.

Types ¶

type Header ¶

type Header struct {
	Type          string // warcinfo|request|response|metadata|revisit|conversion|resource
	Date          time.Time
	RecordID      string
	TargetURI     string
	IPAddress     string
	ConcurrentTo  string
	WarcinfoID    string
	BlockDigest   string
	PayloadDigest string
	RefersTo      string
	Truncated     string
	ContentType   string
	ContentLength int64
	Language      string // WARC-Identified-Content-Language (WET records)
	// Response records only: extracted HTTP fields.
	HTTPStatus int
	HTTPMIME   string
	// Source location for range-request retrieval.
	WARCFilename string
	WARCOffset   int64
	WARCLength   int64
}

Header holds parsed WARC record headers.

type Record ¶

type Record struct {
	Header Header
	Block  []byte
}

Record is a parsed WARC record: its header and the raw block bytes. For a response record the block is the full HTTP message (status line, headers, body).

Source Files ¶

View all Source files

warc.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL