warc

package module
v0.8.76 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2025 License: CC0-1.0 Imports: 36 Imported by: 4

README

warc

GoDoc Go Report Card

A Go library for reading and writing WARC files, with advanced features for web archiving.

Features

  • Read and write WARC files with support for multiple compression formats (GZIP, ZSTD)
  • HTTP client with built-in WARC recording capabilities
  • Content deduplication (local URL-agnostic and CDX-based)
  • Configurable file rotation and size limits
  • DNS caching and custom DNS resolution (with DNS archiving)
  • Support for socks5 proxies and custom TLS configurations
  • Random local IP assignment for distributed crawling (including Linux kernel AnyIP feature)
  • Smart memory management with disk spooling options
  • IPv4/IPv6 support with configurable preferences

Installation

go get github.com/CorentinB/warc

Usage

This library's biggest feature is to provide a standard HTTP client through which you can execute requests that will be recorded automatically to WARC files. It's the basis of Zeno.

HTTP Client with WARC Recording
package main

import (
    "github.com/CorentinB/warc"
    "net/http"
    "time"
)

func main() {
    // Configure WARC settings
    rotatorSettings := &warc.RotatorSettings{
        WarcinfoContent: warc.Header{
            "software": "My WARC writing client v1.0",
        },
        Prefix: "WEB",
        Compression: "gzip",
        WARCWriterPoolSize: 4, // Records will be written to 4 WARC files in parallel, it helps maximize the disk IO on some hardware. To be noted, even if we have multiple WARC writers, WARCs are ALWAYS written by pair in the same file. (req/resp pair)
    }

    // Configure HTTP client settings
    clientSettings := warc.HTTPClientSettings{
        RotatorSettings: rotatorSettings,
        Proxy: "socks5://proxy.example.com:1080",
        TempDir: "./temp",
        DNSServers: []string{"8.8.8.8", "8.8.4.4"},
        DedupeOptions: warc.DedupeOptions{
            LocalDedupe: true,
            CDXDedupe: false,
            SizeThreshold: 2048, // Only payloads above that threshold will be deduped
        },
        DialTimeout: 10 * time.Second,
        ResponseHeaderTimeout: 30 * time.Second,
        DNSResolutionTimeout: 5 * time.Second,
        DNSRecordsTTL: 5 * time.Minute,
        DNSCacheSize: 10000,
        MaxReadBeforeTruncate: 1000000000,
        DecompressBody: true,
        FollowRedirects: true,
        VerifyCerts: true,
        RandomLocalIP: true,
    }

    // Create HTTP client
    client, err := warc.NewWARCWritingHTTPClient(clientSettings)
    if err != nil {
        panic(err)
    }
    defer client.Close()

    // The error channel NEED to be consumed, else it will block the 
    // execution of the WARC module
    go func() {
		for err := range client.Client.ErrChan {
			fmt.Errorf("WARC writer error: %s", err.Err.Error())
		}
	}()

    // This is optional but the module give a feedback on a channel passed as context value "feedback" to the
    // request, this helps knowing when the record has been written to disk. If this is not used, the WARC 
    // writing is asynchronous
	req, err := http.NewRequest("GET", "https://archive.org", nil)
	if err != nil {
		panic(err)
	}

    feedbackChan := make(chan struct{}, 1)
	req := req.WithContext(context.WithValue(req.Context(), "feedback", feedbackChan))

    resp, err := client.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    // Process response
    // Note: the body NEED to be consumed to be written to the WARC file.
    io.Copy(io.Discard, resp.Body)

    // Will block until records are actually written to the WARC file
    <-feedbackChan
}

License

This module is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	IPv6 *availableIPs
	IPv4 *availableIPs
)
View Source
var (

	// Create a counter to keep track of the number of bytes written to WARC files
	// and the number of bytes deduped
	DataTotal         *ratecounter.Counter
	RemoteDedupeTotal *ratecounter.Counter
	LocalDedupeTotal  *ratecounter.Counter
)
View Source
var CDXHTTPClient = http.Client{
	Timeout: 10 * time.Second,
	Transport: &http.Transport{
		Dial: (&net.Dialer{
			Timeout: 5 * time.Second,
		}).Dial,
		TLSHandshakeTimeout: 5 * time.Second,
	},
}

Functions

func GetNextIP added in v0.8.56

func GetNextIP(availableIPs *availableIPs) net.IP

func GetSHA1

func GetSHA1(r io.Reader) string

func GetSHA256 added in v0.8.37

func GetSHA256(r io.Reader) string

func GetSHA256Base16 added in v0.8.37

func GetSHA256Base16(r io.Reader) string

func NewDecompressionReader added in v0.8.41

func NewDecompressionReader(r io.Reader) (io.Reader, error)

NewDecompressionReader will return a new reader transparently doing decompression of GZip, BZip2, XZ, and ZStd.

Types

type CustomHTTPClient added in v0.7.0

type CustomHTTPClient struct {
	WaitGroup *WaitGroupWithCount

	ErrChan    chan *Error
	WARCWriter chan *RecordBatch

	http.Client
	TempDir                string
	WARCWriterDoneChannels []chan bool
	DiscardHook            DiscardHook

	TLSHandshakeTimeout   time.Duration
	MaxReadBeforeTruncate int

	FullOnDisk bool

	// MaxRAMUsageFraction is the fraction of system RAM above which we'll force spooling to disk. For example, 0.5 = 50%.
	// If set to <= 0, the default value is DefaultMaxRAMUsageFraction.
	MaxRAMUsageFraction float64
	// contains filtered or unexported fields
}

func NewWARCWritingHTTPClient added in v0.7.0

func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)

func (*CustomHTTPClient) Close added in v0.7.0

func (c *CustomHTTPClient) Close() error

func (*CustomHTTPClient) WriteRecord added in v0.8.50

func (c *CustomHTTPClient) WriteRecord(WARCTargetURI, WARCType, contentType, payloadString string, payloadReader io.Reader)

type DedupeOptions added in v0.8.0

type DedupeOptions struct {
	CDXURL        string
	CDXCookie     string
	SizeThreshold int
	LocalDedupe   bool
	CDXDedupe     bool
}

type DiscardHook added in v0.8.76

type DiscardHook func(resp *http.Response) (bool, string)

DiscardHook is a hook function that is called for each response. (if set) It can be used to determine if the response should be discarded. Returns:

  • bool: should the response be discarded
  • string: (optional) why the response was discarded or not

type DiscardHookError added in v0.8.76

type DiscardHookError struct {
	URL    string
	Reason string // reason for discarding
	Err    error  // nil: discarded successfully
}

func (*DiscardHookError) Error added in v0.8.76

func (e *DiscardHookError) Error() string

func (*DiscardHookError) Unwrap added in v0.8.76

func (e *DiscardHookError) Unwrap() error

type Error added in v0.8.29

type Error struct {
	Err  error
	Func string
}

type HTTPClientSettings added in v0.8.14

type HTTPClientSettings struct {
	RotatorSettings       *RotatorSettings
	Proxy                 string
	TempDir               string
	DNSServer             string
	DiscardHook           DiscardHook
	DNSServers            []string
	DedupeOptions         DedupeOptions
	DialTimeout           time.Duration
	ResponseHeaderTimeout time.Duration
	DNSResolutionTimeout  time.Duration
	DNSRecordsTTL         time.Duration
	DNSCacheSize          int
	TLSHandshakeTimeout   time.Duration
	TCPTimeout            time.Duration
	MaxReadBeforeTruncate int
	DecompressBody        bool
	FollowRedirects       bool
	FullOnDisk            bool
	MaxRAMUsageFraction   float64
	VerifyCerts           bool
	RandomLocalIP         bool
	DisableIPv4           bool
	DisableIPv6           bool
	IPv6AnyIP             bool
}
type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. If there is no value associated with the key, Get returns "".

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader store the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.ReadCloser) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) ReadRecord

func (r *Reader) ReadRecord() (*Record, bool, error)

ReadRecord reads the next record from the opened WARC file returns:

  • Record: if an error occurred, record **may be** nil. if eol is true, record **must be** nil.
  • bool (eol): if true, we readed all records successfully.
  • error: error

type Record

type Record struct {
	Header  Header
	Content spooledtempfile.ReadWriteSeekCloser
	Version string // WARC/1.0, WARC/1.1 ...
}

Record represents a WARC record.

func NewRecord

func NewRecord(tempDir string, fullOnDisk bool) *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	FeedbackChan chan struct{}
	CaptureTime  string
	Records      []*Record
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp. FeedbackChan is used to signal when the records have been written.

func NewRecordBatch

func NewRecordBatch(feedbackChan chan struct{}) *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time.

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string
	// Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files
	CompressionDictionary string
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
	// WarcSize is in Megabytes
	WarcSize float64
	// WARCWriterPoolSize defines the number of parallel WARC writers
	WARCWriterPoolSize int
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type WaitGroupWithCount added in v0.8.18

type WaitGroupWithCount struct {
	sync.WaitGroup
	// contains filtered or unexported fields
}

func (*WaitGroupWithCount) Add added in v0.8.18

func (wg *WaitGroupWithCount) Add(delta int)

func (*WaitGroupWithCount) Done added in v0.8.18

func (wg *WaitGroupWithCount) Done()

func (*WaitGroupWithCount) Size added in v0.8.18

func (wg *WaitGroupWithCount) Size() int

type Writer

type Writer struct {
	GZIPWriter   *gzip.Writer
	ZSTDWriter   *zstd.Encoder
	FileWriter   *bufio.Writer
	FileName     string
	Compression  string
	ParallelGZIP bool
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) CloseCompressedWriter added in v0.8.20

func (w *Writer) CloseCompressedWriter() (err error)

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Directories

Path Synopsis
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL