Documentation
¶
Index ¶
- Constants
- Variables
- func GenerateWarcFileName(prefix string, compression string, serial int) (fileName string)
- func GetSHA1(r io.Reader) string
- type CustomHTTPClient
- type DedupeOptions
- type HTTPClientSettings
- type Header
- type ReadSeekCloser
- type ReadWriteSeekCloser
- type Reader
- type ReaderAt
- type Record
- type RecordBatch
- type RotatorSettings
- type WaitGroupWithCount
- type Writer
Constants ¶
const MaxInMemorySize = 1000000
MaxInMemorySize is the max number of bytes (currently 1MB) to hold in memory before starting to write to disk
Variables ¶
var ( // Create a counter to keep track of the number of bytes written to WARC files DataTotal *ratecounter.Counter )
Functions ¶
func GenerateWarcFileName ¶ added in v0.8.26
GenerateWarcFileName generate a WARC file name following recommendations of the specs: Prefix-Timestamp-Serial-Crawlhost.warc.gz
Types ¶
type CustomHTTPClient ¶ added in v0.7.0
type CustomHTTPClient struct { http.Client WARCWriter chan *RecordBatch WARCWriterDoneChannels []chan bool WaitGroup *WaitGroupWithCount TempDir string FullOnDisk bool MaxReadBeforeTruncate int DataTotal *ratecounter.Counter // contains filtered or unexported fields }
func NewWARCWritingHTTPClient ¶ added in v0.7.0
func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, errChan chan error, err error)
func (*CustomHTTPClient) Close ¶ added in v0.7.0
func (c *CustomHTTPClient) Close() error
type DedupeOptions ¶ added in v0.8.0
type HTTPClientSettings ¶ added in v0.8.14
type HTTPClientSettings struct { RotatorSettings *RotatorSettings DedupeOptions DedupeOptions Proxy string DecompressBody bool SkipHTTPStatusCodes []int VerifyCerts bool TempDir string FullOnDisk bool MaxReadBeforeTruncate int }
type Header ¶
Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.
type ReadSeekCloser ¶ added in v0.8.9
ReadSeekCloser is an io.Reader + ReaderAt + io.Seeker + io.Closer + Stat
type ReadWriteSeekCloser ¶ added in v0.8.9
type ReadWriteSeekCloser interface { ReadSeekCloser io.Writer }
ReadWriteSeekCloser is an io.Writer + io.Reader + io.Seeker + io.Closer.
func NewSpooledTempFile ¶ added in v0.8.9
func NewSpooledTempFile(filePrefix string, tempDir string, fullOnDisk bool) ReadWriteSeekCloser
NewSpooledTempFile returns an ReadWriteSeekCloser, with some important constraints: you can Write into it, but whenever you call Read or Seek on it, Write is forbidden, will return an error.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader store the bufio.Reader and gzip.Reader for a WARC file
func (*Reader) ReadRecord ¶
ReadRecord reads the next record from the opened WARC file
type ReaderAt ¶ added in v0.8.9
ReaderAt is the interface for ReadAt - read at position, without moving pointer.
type Record ¶
type Record struct { Header Header Content ReadWriteSeekCloser }
Record represents a WARC record.
type RecordBatch ¶
RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp
func NewRecordBatch ¶
func NewRecordBatch() *RecordBatch
NewRecordBatch creates a record batch, it also initialize the capture time
type RotatorSettings ¶
type RotatorSettings struct { // Content of the warcinfo record that will be written // to all WARC files WarcinfoContent Header // Prefix used for WARC filenames, WARC 1.1 specifications // recommend to name files this way: // Prefix-Timestamp-Serial-Crawlhost.warc.gz Prefix string // Compression algorithm to use Compression string // WarcSize is in MegaBytes WarcSize float64 // Directory where the created WARC files will be stored, // default will be the current directory OutputDirectory string // WARCWriterPoolSize defines the number of parallel WARC writers WARCWriterPoolSize int }
RotatorSettings is used to store the settings needed by recordWriter to write WARC files
func NewRotatorSettings ¶
func NewRotatorSettings() *RotatorSettings
NewRotatorSettings creates a RotatorSettings structure and initialize it with default values
func (*RotatorSettings) NewWARCRotator ¶
func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)
NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine
type WaitGroupWithCount ¶ added in v0.8.18
func (*WaitGroupWithCount) Add ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Add(delta int)
func (*WaitGroupWithCount) Done ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Done()
func (*WaitGroupWithCount) Size ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Size() int
type Writer ¶
type Writer struct { FileName string Compression string GZIPWriter *gzip.Writer PGZIPWriter *pgzip.Writer ZSTDWriter *zstd.Encoder FileWriter *bufio.Writer ParallelGZIP bool }
Writer writes WARC records to WARC files.
func NewWriter ¶
func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string) (*Writer, error)
NewWriter creates a new WARC writer.
func (*Writer) CloseCompressedWriter ¶ added in v0.8.20
func (w *Writer) CloseCompressedWriter()
func (*Writer) WriteInfoRecord ¶
WriteInfoRecord method can be used to write informations record to the WARC file
func (*Writer) WriteRecord ¶
WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:
Version CLRF Header-Key: Header-Value CLRF CLRF Content CLRF CLRF