warc

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Overview

Package warc reads WARC files the way Common Crawl stores them: a stream of gzip members, one record per member. It decodes records without buffering the whole file, so a single record can be pulled from an HTTP byte range, and it exposes the small helpers needed to split an HTTP response block into its headers and body.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HTTPBody

func HTTPBody(block []byte) []byte

HTTPBody splits a response block at the header/body boundary and returns the body. It returns the whole block when no boundary is found.

func HTTPHeaders

func HTTPHeaders(block []byte) []byte

HTTPHeaders returns the header section (status line + headers) of a response block, without the body.

func Iterate

func Iterate(r io.Reader, fn func(Record) error) error

Iterate reads a WARC file (a multi-member gzip stream where each member is one record) and calls fn for every record.

The whole input is wrapped in one *bufio.Reader and the SAME reader is handed to gz.Reset on each member boundary. klauspost/compress/gzip keeps that buffered reader (z.r = rb), so read-ahead bytes from the previous member start the next member correctly and no full-file buffering is needed. This is what makes fetching a single record over an HTTP byte range work.

func TrimURI

func TrimURI(s string) string

TrimURI removes the angle brackets WARC sometimes wraps URIs in.

Types

type Header struct {
	Type          string // warcinfo|request|response|metadata|revisit|conversion|resource
	Date          time.Time
	RecordID      string
	TargetURI     string
	IPAddress     string
	ConcurrentTo  string
	WarcinfoID    string
	BlockDigest   string
	PayloadDigest string
	RefersTo      string
	Truncated     string
	ContentType   string
	ContentLength int64
	Language      string // WARC-Identified-Content-Language (WET records)
	// Response records only: extracted HTTP fields.
	HTTPStatus int
	HTTPMIME   string
	// Source location for range-request retrieval.
	WARCFilename string
	WARCOffset   int64
	WARCLength   int64
}

Header holds parsed WARC record headers.

type Record

type Record struct {
	Header Header
	Block  []byte
}

Record is a parsed WARC record: its header and the raw block bytes. For a response record the block is the full HTTP message (status line, headers, body).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL