Redirected from github.com/miku/blobproc.

blobrun

package module

v0.3.1 Latest Latest Go to latest Published: Aug 9, 2024 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/miku/blobrun

README ¶

blobrun

status: not implemented, just a sketch and notes

For a constant influx of PDF files, we wanted to have a tiny, event-based component that would apply processing to those files, using a hotfolder.

A webhook server that can receive raw bytes and execute commands. Original use case: Receiving scholarly PDF documents and running a few derivations on them.

This service does not implement any generic features for now.

Mode of operation

blobrun saves all incoming files in an spool folder and then returns, so this processing should not take longer than the time it takes to write the file to disk.

A periodic scan of the spool directory will pick up new files, and will process them, e.g. send the content to grobid, run pdftotext, and similar.

These derivations can fail and retried, there is not time pressure, as long as the "spool" directory does not exceed a given limit, e.g. 80% of the free space on the disk.

Once all derivations ran successfully, the file is deleted from the "spool" directory. If the server dies and comes back up, the files in the "spool" directory represent the state.

Derivations

For S3, the key will be the content SHA1.

pdftotext, store in S3
grobid, store in S3
thumbnail, store in S3
find all links in fulltext, send to SPNv2

Backfill

Given a cli tool to fetch a list of PDFs from PB, we can complete missing derivations.

Documentation ¶

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrFileTooLarge = errors.New("file too large")

Functions ¶

This section is empty.

Types ¶

type BlobS3 ¶

type BlobS3 struct {
	HostURL       string
	AccessKey     string
	SecretKey     string
	DefaultBucket string
}

BlobS3 slightly wraps I/O around our S3 store.

type ProcessFulltextResult ¶

type ProcessFulltextResult struct {
	StatusCode int
	Status     string
	Error      error
	TEIXML     string
}

ProcessFulltextResult is a wrapped grobid response.

type PutBlobRequest ¶

type PutBlobRequest struct {
	Folder  string
	Blob    []byte
	SHA1Hex string
	Ext     string
	Prefix  string
	Bucket  string
}

type PutBlobResponse ¶

type PutBlobResponse struct {
	Bucket     string
	ObjectPath string
}

type Runner ¶

type Runner struct {
	// SpoolDir is a directory where we expect PDF files.
	SpoolDir string
	// Grobid client wraps grobid service API access.
	Grobid            *grobidclient.Grobid
	MaxGrobidFileSize int64
	ConsolidateMode   bool
	// S3Client wraps access to seaweedfs.
	S3Client *minio.Client
}

Runner run derivations of a file and also stores the results in S3.

func (*Runner) RunGrobid ¶

func (sr *Runner) RunGrobid(filename string) error

func (*Runner) RunPdfThumbnail ¶

func (sr *Runner) RunPdfThumbnail(filename string) error

func (*Runner) RunPdfToText ¶

func (sr *Runner) RunPdfToText(filename string) error

Source Files ¶

View all Source files

runner.go

Directories ¶

Path	Synopsis
cmd
spoolrun
webspoold webspoold takes binary blobs via HTTP POST and save them to disk.	webspoold takes binary blobs via HTTP POST and save them to disk.
grobid
pidfile Package pidfile provides structure and helper functions to create and remove PID file.	Package pidfile provides structure and helper functions to create and remove PID file.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL