Documentation
¶
Index ¶
Constants ¶
const Version = "0.3.18 "
Version of library and cli tools.
Variables ¶
var ( ErrFileTooLarge = errors.New("file too large") ErrInvalidHash = errors.New("invalid hash") DefaultBucket = "sandcrawler" // DefaultBucket for S3 )
Functions ¶
This section is empty.
Types ¶
type BlobRequestOptions ¶ added in v0.3.5
type BlobRequestOptions struct { Folder string Blob []byte SHA1Hex string Ext string Prefix string Bucket string }
BlobRequestOptions wraps the blob request options, both for setting and retrieving a blob.
Currently used folder names:
- "pdf" for thumbnails - "xml_doc" for TEI-XML - "html_body" for HTML TEI-XML - "unknown" for generic
Default bucket is "sandcrawler-dev", other buckets via infra:
- "sandcrawler" for sandcrawler_grobid_bucket - "sandcrawler" for sandcrawler_text_bucket - "thumbnail" for sandcrawler_thumbnail_bucket
type PutBlobResponse ¶
PutBlobResponse wraps a blob put request response.
type WalkStats ¶ added in v0.3.16
func (*WalkStats) SuccessRatio ¶ added in v0.3.16
type Walker ¶ added in v0.3.16
type Walker struct { Dir string NumWorkers int KeepSpool bool GrobidMaxFileSize int64 Timeout time.Duration Grobid *grobidclient.Grobid S3 *WrapS3 // contains filtered or unexported fields }
Walker is a walker that runs postprocessing in parallel.
type WebSpoolService ¶ added in v0.3.5
WebSpoolService saves web payload to a configured directory. TODO: add limit in size (e.g. 80% of disk or absolute value)
func (*WebSpoolService) BlobHandler ¶ added in v0.3.5
func (svc *WebSpoolService) BlobHandler(w http.ResponseWriter, r *http.Request)
BlobHandler receives binary blobs and saves them on disk. This handler returns as soon as the file has been written into the spool directory of the service, using a sharded SHA1 as path.
func (*WebSpoolService) SpoolListHandler ¶ added in v0.3.5
func (svc *WebSpoolService) SpoolListHandler(w http.ResponseWriter, r *http.Request)
SpoolListHandler returns a single, long jsonlines response with information about all files in the spool directory.
func (*WebSpoolService) SpoolStatusHandler ¶ added in v0.3.5
func (svc *WebSpoolService) SpoolStatusHandler(w http.ResponseWriter, r *http.Request)
SpoolStatusHandler returns HTTP 200, if a given file is in the spool directory and HTTP 404, if the file is not in the spool directory.
type WrapS3 ¶ added in v0.3.5
type WrapS3 struct {
Client *minio.Client
}
WrapS3 slightly wraps I/O around our S3 store with convenience methods.
func NewWrapS3 ¶ added in v0.3.5
func NewWrapS3(endpoint string, opts *WrapS3Options) (*WrapS3, error)
NewWrapS3 creates a new, slim wrapper around S3.
func (*WrapS3) PutBlob ¶ added in v0.3.7
func (wrap *WrapS3) PutBlob(ctx context.Context, req *BlobRequestOptions) (*PutBlobResponse, error)
PutBlob takes puts data in to S3 with key derived from the given options. If the options do not contain the SHA1 of the content, it gets computed here. If no bucket name is given, a default bucket name is used. If the bucket does not exist, if gets created.