Documentation
¶
Index ¶
- Constants
- Variables
- func GenerateWarcFileName(prefix string, compression string, atomicSerial *int64) (fileName string)
- func GetSHA1(r io.Reader) string
- func GetSHA256(r io.Reader) string
- func GetSHA256Base16(r io.Reader) string
- func NewDecompressionReader(r io.Reader) (io.Reader, error)
- type CustomHTTPClient
- type DedupeOptions
- type Error
- type HTTPClientSettings
- type Header
- type ReadSeekCloser
- type ReadWriteSeekCloser
- type Reader
- type ReaderAt
- type Record
- type RecordBatch
- type RotatorSettings
- type WaitGroupWithCount
- type Writer
Constants ¶
const MaxInMemorySize = 1000000
MaxInMemorySize is the max number of bytes (currently 1MB) to hold in memory before starting to write to disk
Variables ¶
var ( IPv6 *availableIPs IPv4 *availableIPs )
var ( // Create a counter to keep track of the number of bytes written to WARC files // and the number of bytes deduped DataTotal *ratecounter.Counter RemoteDedupeTotal *ratecounter.Counter LocalDedupeTotal *ratecounter.Counter )
Functions ¶
func GenerateWarcFileName ¶ added in v0.8.26
GenerateWarcFileName generate a WARC file name following recommendations of the specs: Prefix-Timestamp-Serial-Crawlhost.warc.gz
func GetSHA256Base16 ¶ added in v0.8.37
Types ¶
type CustomHTTPClient ¶ added in v0.7.0
type CustomHTTPClient struct { WARCWriter chan *RecordBatch WaitGroup *WaitGroupWithCount ErrChan chan *Error http.Client TempDir string WARCWriterDoneChannels []chan bool MaxReadBeforeTruncate int TLSHandshakeTimeout time.Duration FullOnDisk bool // contains filtered or unexported fields }
func NewWARCWritingHTTPClient ¶ added in v0.7.0
func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)
func (*CustomHTTPClient) Close ¶ added in v0.7.0
func (c *CustomHTTPClient) Close() error
func (*CustomHTTPClient) WriteMetadataRecord ¶ added in v0.8.36
func (c *CustomHTTPClient) WriteMetadataRecord(WARCTargetURI, contentType, payload string)
type DedupeOptions ¶ added in v0.8.0
type HTTPClientSettings ¶ added in v0.8.14
type HTTPClientSettings struct { RotatorSettings *RotatorSettings Proxy string TempDir string SkipHTTPStatusCodes []int DedupeOptions DedupeOptions DialTimeout time.Duration ResponseHeaderTimeout time.Duration TLSHandshakeTimeout time.Duration MaxReadBeforeTruncate int TCPTimeout time.Duration DecompressBody bool FollowRedirects bool FullOnDisk bool VerifyCerts bool RandomLocalIP bool }
type Header ¶
Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.
type ReadSeekCloser ¶ added in v0.8.9
ReadSeekCloser is an io.Reader + ReaderAt + io.Seeker + io.Closer + Stat
type ReadWriteSeekCloser ¶ added in v0.8.9
type ReadWriteSeekCloser interface { ReadSeekCloser io.Writer }
ReadWriteSeekCloser is an io.Writer + io.Reader + io.Seeker + io.Closer.
func NewSpooledTempFile ¶ added in v0.8.9
func NewSpooledTempFile(filePrefix string, tempDir string, fullOnDisk bool) ReadWriteSeekCloser
NewSpooledTempFile returns an ReadWriteSeekCloser, with some important constraints: you can Write into it, but whenever you call Read or Seek on it, Write is forbidden, will return an error.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader store the bufio.Reader and gzip.Reader for a WARC file
func NewReader ¶
func NewReader(reader io.ReadCloser) (*Reader, error)
NewReader returns a new WARC reader
func (*Reader) ReadRecord ¶
ReadRecord reads the next record from the opened WARC file returns:
- Record: if an error occurred, record **may be** nil. if eol is true, record **must be** nil.
- bool (eol): if true, we readed all records successfully.
- error: error
type ReaderAt ¶ added in v0.8.9
ReaderAt is the interface for ReadAt - read at position, without moving pointer.
type Record ¶
type Record struct { Header Header Content ReadWriteSeekCloser Version string // WARC/1.0, WARC/1.1 ... }
Record represents a WARC record.
type RecordBatch ¶
RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp
func NewRecordBatch ¶
func NewRecordBatch() *RecordBatch
NewRecordBatch creates a record batch, it also initialize the capture time
type RotatorSettings ¶
type RotatorSettings struct { // Content of the warcinfo record that will be written // to all WARC files WarcinfoContent Header // Prefix used for WARC filenames, WARC 1.1 specifications // recommend to name files this way: // Prefix-Timestamp-Serial-Crawlhost.warc.gz Prefix string // Compression algorithm to use Compression string // Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files CompressionDictionary string // Directory where the created WARC files will be stored, // default will be the current directory OutputDirectory string // WarcSize is in Megabytes WarcSize float64 // WARCWriterPoolSize defines the number of parallel WARC writers WARCWriterPoolSize int }
RotatorSettings is used to store the settings needed by recordWriter to write WARC files
func NewRotatorSettings ¶
func NewRotatorSettings() *RotatorSettings
NewRotatorSettings creates a RotatorSettings structure and initialize it with default values
func (*RotatorSettings) NewWARCRotator ¶
func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)
NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine
type WaitGroupWithCount ¶ added in v0.8.18
func (*WaitGroupWithCount) Add ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Add(delta int)
func (*WaitGroupWithCount) Done ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Done()
func (*WaitGroupWithCount) Size ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Size() int
type Writer ¶
type Writer struct { GZIPWriter *gzip.Writer ZSTDWriter *zstd.Encoder FileWriter *bufio.Writer FileName string Compression string ParallelGZIP bool }
Writer writes WARC records to WARC files.
func NewWriter ¶
func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)
NewWriter creates a new WARC writer.
func (*Writer) CloseCompressedWriter ¶ added in v0.8.20
func (w *Writer) CloseCompressedWriter()
func (*Writer) WriteInfoRecord ¶
WriteInfoRecord method can be used to write informations record to the WARC file
func (*Writer) WriteRecord ¶
WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:
Version CLRF Header-Key: Header-Value CLRF CLRF Content CLRF CLRF