warc

package module
v0.8.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2022 License: CC0-1.0 Imports: 23 Imported by: 4

README

warc

GoDoc Go Report Card

WARNING: This project is still a WIP. It is NOT ready to be used in any project.

Introduction

warc provides methods for reading and writing WARC files in Go. This module is based on nlevitt's WARC module.

Install

go get github.com/CorentinB/warc

License

warc is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.

Documentation

Index

Constants

View Source
const MaxInMemorySize = 512000

MaxInMemorySize is the max number of bytes (currently 500KB) to hold in memory before starting to write to disk

Variables

This section is empty.

Functions

func GetSHA1

func GetSHA1(r io.Reader) string

Types

type CustomHTTPClient added in v0.7.0

type CustomHTTPClient struct {
	http.Client
	WARCWriter       chan *RecordBatch
	WARCWriterFinish chan bool
	WaitGroup        *sync.WaitGroup

	WARCTempDir string
	// contains filtered or unexported fields
}

func NewWARCWritingHTTPClient added in v0.7.0

func NewWARCWritingHTTPClient(rotatorSettings *RotatorSettings, proxy string, decompressBody bool, dedupeOptions DedupeOptions, skipHTTPStatusCodes []int, verifyCerts bool, WARCTempDir string) (httpClient *CustomHTTPClient, err error, errChan chan error)

func (*CustomHTTPClient) Close added in v0.7.0

func (c *CustomHTTPClient) Close() error

type DedupeOptions added in v0.8.0

type DedupeOptions struct {
	LocalDedupe bool
	CDXDedupe   bool
	CDXURL      string
}
type Header map[string]string

Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.

func NewHeader

func NewHeader() Header

NewHeader creates a new WARC header.

func (Header) Del

func (h Header) Del(key string)

Del deletes the value associated with key.

func (Header) Get

func (h Header) Get(key string) string

Get returns the value associated with the given key. If there is no value associated with the key, Get returns "".

func (Header) Set

func (h Header) Set(key, value string)

Set sets the header field associated with key to value.

type ReadSeekCloser added in v0.8.9

type ReadSeekCloser interface {
	io.Reader
	io.Seeker
	ReaderAt
	io.Closer
	FileName() string
}

ReadSeekCloser is an io.Reader + ReaderAt + io.Seeker + io.Closer + Stat

type ReadWriteSeekCloser added in v0.8.9

type ReadWriteSeekCloser interface {
	ReadSeekCloser
	io.Writer
}

ReadWriteSeekCloser is an io.Writer + io.Reader + io.Seeker + io.Closer.

func NewSpooledTempFile added in v0.8.9

func NewSpooledTempFile(filePrefix string, tempDir string) ReadWriteSeekCloser

NewSpooledTempFile returns an ReadWriteSeekCloser, with some important constraints: you can Write into it, but whenever you call Read or Seek on it, Write is forbidden, will return an error.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader store the bufio.Reader and gzip.Reader for a WARC file

func NewReader

func NewReader(reader io.Reader) (*Reader, error)

NewReader returns a new WARC reader

func (*Reader) Close

func (r *Reader) Close()

Close closes the reader.

func (*Reader) ReadRecord

func (r *Reader) ReadRecord() (*Record, error)

ReadRecord reads the next record from the opened WARC file. If onDisk is set to true, then the record's payload will be written to a temp file on disk, and specified in the *Record.PayloadPath, else, everything happen in memory.

type ReaderAt added in v0.8.9

type ReaderAt interface {
	ReadAt(p []byte, off int64) (n int, err error)
}

ReaderAt is the interface for ReadAt - read at position, without moving pointer.

type Record

type Record struct {
	Header  Header
	Content ReadWriteSeekCloser
}

Record represents a WARC record.

func NewRecord

func NewRecord(tempDir string) *Record

NewRecord creates a new WARC record.

type RecordBatch

type RecordBatch struct {
	Records     []*Record
	Done        chan bool
	CaptureTime string
}

RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp

func NewRecordBatch

func NewRecordBatch() *RecordBatch

NewRecordBatch creates a record batch, it also initialize the capture time

type RotatorSettings

type RotatorSettings struct {
	// Content of the warcinfo record that will be written
	// to all WARC files
	WarcinfoContent Header
	// Prefix used for WARC filenames, WARC 1.1 specifications
	// recommend to name files this way:
	// Prefix-Timestamp-Serial-Crawlhost.warc.gz
	Prefix string
	// Compression algorithm to use
	Compression string
	// WarcSize is in MegaBytes
	WarcSize float64
	// Directory where the created WARC files will be stored,
	// default will be the current directory
	OutputDirectory string
}

RotatorSettings is used to store the settings needed by recordWriter to write WARC files

func NewRotatorSettings

func NewRotatorSettings() *RotatorSettings

NewRotatorSettings creates a RotatorSettings structure and initialize it with default values

func (*RotatorSettings) NewWARCRotator

func (s *RotatorSettings) NewWARCRotator() (recordWriterChannel chan *RecordBatch, done chan bool, err error)

NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine

type Writer

type Writer struct {
	FileName    string
	Compression string
	GZIPWriter  *gzip.Writer
	ZSTDWriter  *zstd.Encoder
	FileWriter  *bufio.Writer
}

Writer writes WARC records to WARC files.

func NewWriter

func NewWriter(writer io.Writer, fileName string, compression string) (*Writer, error)

NewWriter creates a new WARC writer.

func (*Writer) WriteInfoRecord

func (w *Writer) WriteInfoRecord(payload map[string]string) (recordID string, err error)

WriteInfoRecord method can be used to write informations record to the WARC file

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(r *Record) (recordID string, err error)

WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:

Version CLRF
Header-Key: Header-Value CLRF
CLRF
Content
CLRF
CLRF

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL