Documentation ¶
Overview ¶
Package warc provides methods for reading and writing WARC files (https://iipc.github.io/warc-specifications/) in Go. This module is based on nlevitt's WARC module (https://github.com/nlevitt/warc).
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GetSHA1 ¶
GetSHA1 return the SHA1 of a []byte, can be used to fill the WARC-Block-Digest header
func GetSHA1FromFile ¶
GetSHA1FromFile return the SHA1 of a file, can be used to fill the WARC-Block-Digest header
Types ¶
type Header ¶
Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader store the bufio.Reader and gzip.Reader for a WARC file
func (*Reader) ReadRecord ¶
ReadRecord reads the next record from the opened WARC file. If onDisk is set to true, then the record's payload will be written to a temp file on disk, and specified in the *Record.PayloadPath, else, everything happen in memory.
type RecordBatch ¶
RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp
func NewRecordBatch ¶
func NewRecordBatch() *RecordBatch
NewRecordBatch creates a record batch, it also initialize the capture time
type RotatorSettings ¶
type RotatorSettings struct { // Content of the warcinfo record that will be written // to all WARC files WarcinfoContent Header // Prefix used for WARC filenames, WARC 1.1 specifications // recommend to name files this way: // Prefix-Timestamp-Serial-Crawlhost.warc.gz Prefix string // Compression algorithm to use Compression string // WarcSize is in MegaBytes WarcSize float64 // Directory where the created WARC files will be stored, // default will be the current directory OutputDirectory string }
RotatorSettings is used to store the settings needed by recordWriter to write WARC files
func NewRotatorSettings ¶
func NewRotatorSettings() *RotatorSettings
NewRotatorSettings creates a RotatorSettings structure and initialize it with default values
func (*RotatorSettings) NewWARCRotator ¶
func (s *RotatorSettings) NewWARCRotator() (recordWriterChannel chan *RecordBatch, done chan bool, err error)
NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine
type Writer ¶
type Writer struct { FileName string Compression string GZIPWriter *gzip.Writer ZSTDWriter *zstd.Encoder FileWriter *bufio.Writer }
Writer writes WARC records to WARC files.
func (*Writer) WriteInfoRecord ¶
WriteInfoRecord method can be used to write informations record to the WARC file
func (*Writer) WriteRecord ¶
WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:
Version CLRF Header-Key: Header-Value CLRF CLRF Content CLRF CLRF