blobproc

package module

v0.3.20 Latest Latest Go to latest Published: Aug 28, 2024 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/miku/blobproc

Links

Open Source Insights

README ¶

BLOBPROC

status: testing

BLOBPROC is a shrink wrap version of PDF blob postprocessing found in sandcrawler. Specifically it is designed to process and persist documents without any extra component, like a database or a separate queuing system and do this in a performant, reliant, boring and observable way.

BLOBPROC contains two components:

blobprocd exposes an HTTP server that can receive binary data and stores it in a spool folder
blobproc is a process that scans the spool folder and executes post processing tasks on each PDF, and removes the file from spool, if all processing succeeded

In our case blobproc will execute the following tasks:

send PDF to grobid and store the result in S3
generate text from PDF and store the result in S3
generate a thumbnail from a PDF and store the result in S3
find all weblinks in PDF text and send them to a crawl API

More tasks can be added by extending blobproc itself. A focus remains on simple deployment via an OS distribution package.

Mode of operation

receive blob over HTTP, may be heritrix, curl, some backfill process
regularly scan spool dir and process found files

Usage

Server component.

$ blobprocd -h
Usage of blobprocd:
  -T duration
        server timeout (default 15s)
  -access-log string
        server access logfile, none if empty
  -addr string
        host port to listen on (default "0.0.0.0:8000")
  -debug
        switch to log level DEBUG
  -log string
        structured log output file, stderr if empty
  -spool string
         (default "/home/tir/.local/share/blobproc/spool")
  -version
        show version

Processing command line tool.

$ blobproc -h
blobproc - process and persist PDF documents derivations

Emit JSON with locally extracted data:

  $ blobproc -f file.pdf | jq .

Flags

  -T duration
        subprocess timeout (default 5m0s)
  -debug
        more verbose output
  -f string
        process a single file (local tools only), for testing
  -grobid-host string
        grobid host, cf. https://is.gd/3wnssq (default "http://localhost:8070")
  -grobid-max-filesize int
        max file size to send to grobid in bytes (default 268435456)
  -k    keep files in spool after processing, mainly for debugging
  -logfile string
        structured log output file, stderr if empty
  -s3-access-key string
        S3 access key (default "minioadmin")
  -s3-endpoint string
        S3 endpoint (default "localhost:9000")
  -s3-secret-key string
        S3 secret key (default "minioadmin")
  -spool string
         (default "/home/tir/.local/share/blobproc/spool")
  -version
        show version

Performance data points

The initial, unoptimized version would process about 25 PDF docs/minute or 36K pdfs/day. We were able to crawl much faster than that, e.g. we reached 63G captured data (not all pdf) after about 4 hours. GROBID should be able to handle up to 10 docs/s.

Scaling

TODO: tasks will run in parallel, e.g. text, thumbnail generation and grobid all run in parallel, but we process one file by one for now
TODO: we should be able to configure a pool of grobid hosts to send requests to

Backfill

point to CDX file, crawl collection or similar and have all PDF files sent to BLOBPROC, even if this may take days or weeks

TODO

pluggable write backend for testing, e.g. just log what would happen
log performance measures
grafana

Notes

This tool should cover most of the following areas from sandcrawler:

run_grobid_extract
run_pdf_extract
run_persist_grobid
run_persist_pdftext
run_persist_thumbnail

Including references workers.

Performance: Processing 1605 pdfs, 1515 successful, 2.23 docs/s, when processed in parallel, via fd ... -x - or about 200K docs per day.

real    11m0.767s
user    73m57.763s
sys     5m55.393s

Documentation ¶

Index ¶

Constants
Variables
type BlobRequestOptions
type Payload
type PutBlobResponse
type WalkFast
- func (w *WalkFast) Run(ctx context.Context) error
type WalkStats
- func (ws *WalkStats) SuccessRatio() float64
type WebSpoolService
type WrapS3
- func NewWrapS3(endpoint string, opts *WrapS3Options) (*WrapS3, error)
- func (wrap *WrapS3) GetBlob(ctx context.Context, req *BlobRequestOptions) ([]byte, error)
- func (wrap *WrapS3) PutBlob(ctx context.Context, req *BlobRequestOptions) (*PutBlobResponse, error)
type WrapS3Options

Constants ¶

View Source

const Version = "0.3.19 "

Version of library and cli tools.

Variables ¶

View Source

var (
	ErrFileTooLarge = errors.New("file too large")
	ErrInvalidHash  = errors.New("invalid hash")
	DefaultBucket   = "sandcrawler" // DefaultBucket for S3
)

Functions ¶

This section is empty.

Types ¶

type BlobRequestOptions ¶ added in v0.3.5

type BlobRequestOptions struct {
	Folder  string
	Blob    []byte
	SHA1Hex string
	Ext     string
	Prefix  string
	Bucket  string
}

BlobRequestOptions wraps the blob request options, both for setting and retrieving a blob.

Currently used folder names:

- "pdf" for thumbnails - "xml_doc" for TEI-XML - "html_body" for HTML TEI-XML - "unknown" for generic

Default bucket is "sandcrawler-dev", other buckets via infra:

- "sandcrawler" for sandcrawler_grobid_bucket - "sandcrawler" for sandcrawler_text_bucket - "thumbnail" for sandcrawler_thumbnail_bucket

type Payload ¶ added in v0.3.16

type Payload struct {
	Path     string
	FileInfo fs.FileInfo
}

Payload is what we pass to workers. Since the worker needs file size information, we pass it along, as the expensive stat has already been performed.

type PutBlobResponse ¶

type PutBlobResponse struct {
	Bucket     string
	ObjectPath string
}

PutBlobResponse wraps a blob put request response.

type WalkFast ¶ added in v0.3.19

type WalkFast struct {
	Dir               string
	NumWorkers        int
	KeepSpool         bool
	GrobidMaxFileSize int64
	Timeout           time.Duration
	Grobid            *grobidclient.Grobid
	S3                *WrapS3
	// contains filtered or unexported fields
}

WalkFast is a walker that runs postprocessing in parallel.

func (*WalkFast) Run ¶ added in v0.3.19

func (w *WalkFast) Run(ctx context.Context) error

Run start processing files. Do some basic sanity check before setting up workers as we do not have a constructor function.

type WalkStats ¶ added in v0.3.16

type WalkStats struct {
	Processed int
	OK        int
}

WalkStats are a poor mans metrics.

func (*WalkStats) SuccessRatio ¶ added in v0.3.16

func (ws *WalkStats) SuccessRatio() float64

SuccessRatio calculates the ration of successful to total processed files.

type WebSpoolService ¶ added in v0.3.5

type WebSpoolService struct {
	Dir        string
	ListenAddr string
}

WebSpoolService saves web payload to a configured directory. TODO: add limit in size (e.g. 80% of disk or absolute value)

func (*WebSpoolService) BlobHandler ¶ added in v0.3.5

func (svc *WebSpoolService) BlobHandler(w http.ResponseWriter, r *http.Request)

BlobHandler receives binary blobs and saves them on disk. This handler returns as soon as the file has been written into the spool directory of the service, using a sharded SHA1 as path.

func (*WebSpoolService) SpoolListHandler ¶ added in v0.3.5

func (svc *WebSpoolService) SpoolListHandler(w http.ResponseWriter, r *http.Request)

SpoolListHandler returns a single, long jsonlines response with information about all files in the spool directory.

func (*WebSpoolService) SpoolStatusHandler ¶ added in v0.3.5

func (svc *WebSpoolService) SpoolStatusHandler(w http.ResponseWriter, r *http.Request)

SpoolStatusHandler returns HTTP 200, if a given file is in the spool directory and HTTP 404, if the file is not in the spool directory.

type WrapS3 ¶ added in v0.3.5

type WrapS3 struct {
	Client *minio.Client
}

WrapS3 slightly wraps I/O around our S3 store with convenience methods.

func NewWrapS3 ¶ added in v0.3.5

func NewWrapS3(endpoint string, opts *WrapS3Options) (*WrapS3, error)

NewWrapS3 creates a new, slim wrapper around S3.

func (*WrapS3) GetBlob ¶ added in v0.3.7

func (wrap *WrapS3) GetBlob(ctx context.Context, req *BlobRequestOptions) ([]byte, error)

GetBlob returns the object bytes given a blob request.

func (*WrapS3) PutBlob ¶ added in v0.3.7

func (wrap *WrapS3) PutBlob(ctx context.Context, req *BlobRequestOptions) (*PutBlobResponse, error)

PutBlob takes puts data in to S3 with key derived from the given options. If the options do not contain the SHA1 of the content, it gets computed here. If no bucket name is given, a default bucket name is used. If the bucket does not exist, if gets created.

type WrapS3Options ¶ added in v0.3.5

type WrapS3Options struct {
	AccessKey     string
	SecretKey     string
	DefaultBucket string
	UseSSL        bool
}

WrapS3Options mostly contains pass through options for minio client. Keys from environment, e.g. ...BLOB_ACCESS_KEY

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
blobproc
blobprocd blobprocd takes blobs via HTTP POST or PUT and saves them to disk.	blobprocd takes blobs via HTTP POST or PUT and saves them to disk.
fileutils
pdfextract
pdfinfo

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL