blobrun

package module
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 9, 2024 License: MIT Imports: 7 Imported by: 0

README

blobrun

status: not implemented, just a sketch and notes

For a constant influx of PDF files, we wanted to have a tiny, event-based component that would apply processing to those files, using a hotfolder.

A webhook server that can receive raw bytes and execute commands. Original use case: Receiving scholarly PDF documents and running a few derivations on them.

This service does not implement any generic features for now.

Mode of operation

blobrun saves all incoming files in an spool folder and then returns, so this processing should not take longer than the time it takes to write the file to disk.

A periodic scan of the spool directory will pick up new files, and will process them, e.g. send the content to grobid, run pdftotext, and similar.

These derivations can fail and retried, there is not time pressure, as long as the "spool" directory does not exceed a given limit, e.g. 80% of the free space on the disk.

Once all derivations ran successfully, the file is deleted from the "spool" directory. If the server dies and comes back up, the files in the "spool" directory represent the state.

Derivations

For S3, the key will be the content SHA1.

  • pdftotext, store in S3
  • grobid, store in S3
  • thumbnail, store in S3
  • find all links in fulltext, send to SPNv2

Backfill

Given a cli tool to fetch a list of PDFs from PB, we can complete missing derivations.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrFileTooLarge = errors.New("file too large")

Functions

This section is empty.

Types

type BlobS3

type BlobS3 struct {
	HostURL       string
	AccessKey     string
	SecretKey     string
	DefaultBucket string
}

BlobS3 slightly wraps I/O around our S3 store.

type ProcessFulltextResult

type ProcessFulltextResult struct {
	StatusCode int
	Status     string
	Error      error
	TEIXML     string
}

ProcessFulltextResult is a wrapped grobid response.

type PutBlobRequest

type PutBlobRequest struct {
	Folder  string
	Blob    []byte
	SHA1Hex string
	Ext     string
	Prefix  string
	Bucket  string
}

type PutBlobResponse

type PutBlobResponse struct {
	Bucket     string
	ObjectPath string
}

type Runner

type Runner struct {
	// SpoolDir is a directory where we expect PDF files.
	SpoolDir string
	// Grobid client wraps grobid service API access.
	Grobid            *grobidclient.Grobid
	MaxGrobidFileSize int64
	ConsolidateMode   bool
	// S3Client wraps access to seaweedfs.
	S3Client *minio.Client
}

Runner run derivations of a file and also stores the results in S3.

func (*Runner) RunGrobid

func (sr *Runner) RunGrobid(filename string) error

func (*Runner) RunPdfThumbnail

func (sr *Runner) RunPdfThumbnail(filename string) error

func (*Runner) RunPdfToText

func (sr *Runner) RunPdfToText(filename string) error

Directories

Path Synopsis
cmd
webspoold
webspoold takes binary blobs via HTTP POST and save them to disk.
webspoold takes binary blobs via HTTP POST and save them to disk.
Package pidfile provides structure and helper functions to create and remove PID file.
Package pidfile provides structure and helper functions to create and remove PID file.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL