Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var BAD_PDF_SHA1HEX = []string{}/* 152 elements not displayed */
This is a hack to work around timeouts when processing certain PDFs with poppler. For some reason, the usual Kafka timeout catcher isn't working on these, maybe due to threading.
var ErrNoData = errors.New("no data")
Functions ¶
This section is empty.
Types ¶
type FileInfo ¶
type FileInfo struct {
Size int64 `json:"size"`
SHA1Hex string `json:"sha1hex"`
SHA256Hex string `json:"sha256hex"`
MD5Hex string `json:"md5hex"`
Mimetype string `json:"mimetype"`
}
FileInfo groups checksum and size for a file. The checksums should all be lowercase hex digests.
type Result ¶
type Result struct {
SHA1Hex string `json:"sha1hex,omitempty"` // The SHA1 of the PDF, used later as key.
Status string `json:"status,omitempty"` // A free form status string.
Err error `json:"err,omitempty"` // Any error we encountered.
FileInfo *FileInfo `json:"fileinfo,omitempty"` // Size and checksums.
Text string `json:"text,omitempty"` // Fulltext as parsed with a tool, e.g. pdftotext.
Page0Thumbnail []byte `json:"page0thumbnail,omitempty"` // Thumbnail image, jpg format.
MetaXML string `json:"metaxml,omitempty"` // Unassigned.
Metadata *pdfinfo.Metadata `json:"metadata,omitempty"` // New, grouped by tool, info about a pdf.
PDFExtra *pdfinfo.PDFExtra `json:"pdfextra,omitempty"` // pdfextra, as provided by sandcrawler
Source json.RawMessage `json:"source,omitempty"` // Unassigned.
Weblinks []string `json:"weblinks,omitempty"` // Extracted link candidates from fulltext.
}
Result is the result of a text and thumbnail extraction from a PDF. Both are combined since previous implementation used the poppler library in one go for performance. The first processing error encountered is recorded in Err.
func ProcessBlob ¶
ProcessBlob takes a blob and returns a pdf extract result. TODO(martin): we can makes this faster by running various subprocesses in parallel. TODO(martin): we take a blob from memory only to persist it and run the cli tools over it, we should not require that much memory.
func ProcessFile ¶
ProcessFile turns a PDF file to a structured output.
func (*Result) HasPage0Thumbnail ¶
HasPage0Thumbnail is a derived property.