Documentation
¶
Index ¶
Constants ¶
const ( DefaultURLMapHttpHeader = "X-BLOBPROC-URL" ExpectedSHA1Length = 40 )
const Version = "0.4.0"
Version of library and cli tools.
Variables ¶
var ( ErrFileTooLarge = errors.New("file too large") ErrInvalidHash = errors.New("invalid hash") DefaultBucket = "sandcrawler" // DefaultBucket for S3 )
Functions ¶
This section is empty.
Types ¶
type BlobRequestOptions ¶
type BlobRequestOptions struct {
Folder string
Blob []byte
SHA1Hex string
Ext string
Prefix string
Bucket string
}
BlobRequestOptions wraps the blob request options, both for setting and retrieving a blob.
Currently used folder names:
- "pdf" for thumbnails - "xml_doc" for TEI-XML - "html_body" for HTML TEI-XML - "unknown" for generic
Default bucket is "sandcrawler-dev", other buckets via infra:
- "sandcrawler" for sandcrawler_grobid_bucket - "sandcrawler" for sandcrawler_text_bucket - "thumbnail" for sandcrawler_thumbnail_bucket
type BlobStore ¶
BlobStore slightly wraps I/O around our S3 store with convenience methods.
func NewBlobStore ¶
func NewBlobStore(endpoint string, opts *BlobStoreOptions) (*BlobStore, error)
NewBlobStore creates a new, slim wrapper around S3.
func (*BlobStore) PutBlob ¶
func (bs *BlobStore) PutBlob(ctx context.Context, req *BlobRequestOptions) (*PutBlobResponse, error)
PutBlob puts data in to S3 with key derived from the given options. If the options do not contain the SHA1 of the content, it gets computed here. If no bucket name is given, a default bucket name is used. If the bucket does not exist, if gets created.
type BlobStoreOptions ¶
BlobStoreOptions mostly contains pass through options for minio client. Keys from environment, e.g. ...BLOB_ACCESS_KEY
type LimitedReader ¶
LimitedReader wraps an io.Reader and limits the number of bytes that can be read
type Payload ¶
Payload is what we pass to workers. Since the worker needs file size information, we pass it along, as the expensive stat has already been performed.
type ProcessPDFParams ¶
type ProcessPDFParams struct {
Path string
Size int64
Grobid *grobidclient.Grobid
S3 *BlobStore
GrobidMaxFileSize int64
Logger *slog.Logger
}
ProcessPDFParams configures a single PDF processing run.
Grobid and S3 are both optional; a nil client causes the corresponding derivative step to be logged and skipped. Logger defaults to slog.Default().
type ProcessPDFResult ¶
type ProcessPDFResult struct {
SHA1Hex string // sha1 of the input file
Thumbnail []byte // page-0 JPEG thumbnail
Text string // extracted plain text
TEI []byte // GROBID TEI XML body
}
ProcessPDFResult collects the derivatives extracted from a PDF. Library callers can use this directly (e.g. with S3 = nil) instead of (or in addition to) the S3 uploads ProcessPDF performs when an S3 client is configured. Any field may be empty if the corresponding step was skipped or failed; consult the returned errors for details.
func ProcessPDF ¶
func ProcessPDF(ctx context.Context, p ProcessPDFParams) (*ProcessPDFResult, []error)
ProcessPDF runs the full per-file pipeline against a PDF on disk: pdfextract for text + page-0 thumbnail, then GROBID for structured TEI. When an S3 client is configured each derivative is uploaded to its conventional bucket/folder; the same data is also returned in the result so callers can use ProcessPDF as a library function. The returned result is always non-nil, with fields populated as they become available. The errors slice collects every error encountered; an empty (or nil) slice means the run was fully successful. The caller is responsible for stats accounting and removing the file from the spool.
type PutBlobResponse ¶
PutBlobResponse wraps a blob put request response.
type URLMap ¶
type URLMap struct {
Path string
// contains filtered or unexported fields
}
URLMap wraps an sqlite3 database for URL and SHA1 lookups.
type WalkFast ¶
type WalkFast struct {
Dir string
NumWorkers int
KeepSpool bool
GrobidMaxFileSize int64
Timeout time.Duration
Grobid *grobidclient.Grobid
S3 *BlobStore
// contains filtered or unexported fields
}
WalkFast is a walker that runs postprocessing in parallel.
type WalkStats ¶
WalkStats are a poor mans metrics.
func (*WalkStats) SuccessRatio ¶
SuccessRatio calculates the ration of successful to total processed files.
type WebSpoolService ¶
type WebSpoolService struct {
Dir string
ListenAddr string
// TODO: add a (optional) reference to a store for url content hashes; it
// would be good to keep it optional (so one may just copy files into the
// spool folder), and maybe to provide a simple interface that can be
// easily fulfilled by different backend.
URLMap *URLMap
// The HTTP header to look for a URL associated with a pdf blob payload.
URLMapHttpHeader string
// Minimum required free disk space percentage (default 10%)
MinFreeDiskPercent int
// Maximum allowed file size (default 0 = no limit)
MaxFileSize int64
}
WebSpoolService saves web payload to a configured directory. TODO: add limit in size (e.g. 80% of disk or absolute value)
func (*WebSpoolService) BlobHandler ¶
func (svc *WebSpoolService) BlobHandler(w http.ResponseWriter, r *http.Request)
BlobHandler receives binary blobs and saves them on disk. This handler returns as soon as the file has been written into the spool directory of the service, using a sharded SHA1 as path.
func (*WebSpoolService) SpoolListHandler ¶
func (svc *WebSpoolService) SpoolListHandler(w http.ResponseWriter, r *http.Request)
SpoolListHandler returns a single, long jsonlines response with information about all files in the spool directory.
func (*WebSpoolService) SpoolStatusHandler ¶
func (svc *WebSpoolService) SpoolStatusHandler(w http.ResponseWriter, r *http.Request)
SpoolStatusHandler returns HTTP 200, if a given file is in the spool directory and HTTP 404, if the file is not in the spool directory.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package cdx wraps CDX records.
|
Package cdx wraps CDX records. |
|
cmd
|
|
|
blobfetch
command
blobfetch finds and fetches files from archive collections to be put into a spool folder for postprocessing.
|
blobfetch finds and fetches files from archive collections to be put into a spool folder for postprocessing. |
|
blobproc
command
|
|
|
Package dedent: https://github.com/lithammer/dedent
|
Package dedent: https://github.com/lithammer/dedent |
