Documentation
¶
Index ¶
Constants ¶
const Version = "0.3.19 "
Version of library and cli tools.
Variables ¶
var ( ErrFileTooLarge = errors.New("file too large") ErrInvalidHash = errors.New("invalid hash") DefaultBucket = "sandcrawler" // DefaultBucket for S3 )
Functions ¶
This section is empty.
Types ¶
type BlobRequestOptions ¶ added in v0.3.5
type BlobRequestOptions struct { Folder string Blob []byte SHA1Hex string Ext string Prefix string Bucket string }
BlobRequestOptions wraps the blob request options, both for setting and retrieving a blob.
Currently used folder names:
- "pdf" for thumbnails - "xml_doc" for TEI-XML - "html_body" for HTML TEI-XML - "unknown" for generic
Default bucket is "sandcrawler-dev", other buckets via infra:
- "sandcrawler" for sandcrawler_grobid_bucket - "sandcrawler" for sandcrawler_text_bucket - "thumbnail" for sandcrawler_thumbnail_bucket
type Payload ¶ added in v0.3.16
Payload is what we pass to workers. Since the worker needs file size information, we pass it along, as the expensive stat has already been performed.
type PutBlobResponse ¶
PutBlobResponse wraps a blob put request response.
type WalkFast ¶ added in v0.3.19
type WalkFast struct { Dir string NumWorkers int KeepSpool bool GrobidMaxFileSize int64 Timeout time.Duration Grobid *grobidclient.Grobid S3 *WrapS3 // contains filtered or unexported fields }
WalkFast is a walker that runs postprocessing in parallel.
type WalkStats ¶ added in v0.3.16
WalkStats are a poor mans metrics.
func (*WalkStats) SuccessRatio ¶ added in v0.3.16
SuccessRatio calculates the ration of successful to total processed files.
type WebSpoolService ¶ added in v0.3.5
WebSpoolService saves web payload to a configured directory. TODO: add limit in size (e.g. 80% of disk or absolute value)
func (*WebSpoolService) BlobHandler ¶ added in v0.3.5
func (svc *WebSpoolService) BlobHandler(w http.ResponseWriter, r *http.Request)
BlobHandler receives binary blobs and saves them on disk. This handler returns as soon as the file has been written into the spool directory of the service, using a sharded SHA1 as path.
func (*WebSpoolService) SpoolListHandler ¶ added in v0.3.5
func (svc *WebSpoolService) SpoolListHandler(w http.ResponseWriter, r *http.Request)
SpoolListHandler returns a single, long jsonlines response with information about all files in the spool directory.
func (*WebSpoolService) SpoolStatusHandler ¶ added in v0.3.5
func (svc *WebSpoolService) SpoolStatusHandler(w http.ResponseWriter, r *http.Request)
SpoolStatusHandler returns HTTP 200, if a given file is in the spool directory and HTTP 404, if the file is not in the spool directory.
type WrapS3 ¶ added in v0.3.5
type WrapS3 struct {
Client *minio.Client
}
WrapS3 slightly wraps I/O around our S3 store with convenience methods.
func NewWrapS3 ¶ added in v0.3.5
func NewWrapS3(endpoint string, opts *WrapS3Options) (*WrapS3, error)
NewWrapS3 creates a new, slim wrapper around S3.
func (*WrapS3) PutBlob ¶ added in v0.3.7
func (wrap *WrapS3) PutBlob(ctx context.Context, req *BlobRequestOptions) (*PutBlobResponse, error)
PutBlob takes puts data in to S3 with key derived from the given options. If the options do not contain the SHA1 of the content, it gets computed here. If no bucket name is given, a default bucket name is used. If the bucket does not exist, if gets created.