Documentation
¶
Overview ¶
Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Chain ¶
type Chain struct {
// contains filtered or unexported fields
}
Chain combines multiple Pipeline stages into a single Pipeline. Stages are executed in the order they were provided to NewChain. If any stage returns nil (drop), the item is dropped and subsequent stages are skipped. If any stage returns an error, processing stops and the error is returned immediately.
type Clean ¶
type Clean struct {
// TrimWhitespace calls strings.TrimSpace on every string field value.
TrimWhitespace bool
// StripHTML removes HTML tags (matching <…>) from every string field value.
StripHTML bool
// NormalizePrice converts currency strings like "$1,234.56" to float64.
// The $, €, £ currency prefixes and comma separators are removed before parsing.
// Non-parseable strings are left unchanged.
NormalizePrice bool
// NormalizeDate parses common date strings and rewrites them as "2006-01-02".
// Unrecognised strings are left unchanged.
NormalizeDate bool
}
Clean performs data cleaning on item fields. Each option is applied in order: TrimWhitespace → StripHTML → NormalizePrice → NormalizeDate.
type FieldTransform ¶
type FieldTransform struct {
// Field is the source field name.
Field string
// RegexFind is the pattern to match (empty = skip regex).
RegexFind string
// RegexReplace is the replacement string (supports $1, $2, etc).
RegexReplace string
// RenameTo renames the field (empty = keep original name).
RenameTo string
// CoerceTo converts the field value: "int", "float", "bool", "string".
CoerceTo string
}
FieldTransform defines a transformation to apply to a single item field.
type FieldTransformPipeline ¶
type FieldTransformPipeline struct {
// contains filtered or unexported fields
}
FieldTransformPipeline applies a list of field transformations to each item.
func NewFieldTransformPipeline ¶
func NewFieldTransformPipeline(transforms []FieldTransform) *FieldTransformPipeline
NewFieldTransformPipeline creates a pipeline from a list of transforms.
type ItemDedup ¶
type ItemDedup struct {
// KeyField is the item field used as the deduplication key.
KeyField string
// contains filtered or unexported fields
}
ItemDedup drops duplicate items based on a key field. The first item with a given key value passes through; subsequent items with the same key are dropped (Process returns nil, nil). Items that are missing the key field entirely are also dropped.
ItemDedup is safe for concurrent use.
func NewItemDedup ¶
NewItemDedup returns an ItemDedup that deduplicates on keyField.
type Transform ¶
type Transform struct {
// Fn is the user-provided transformation function.
Fn func(item *foxhound.Item) (*foxhound.Item, error)
}
Transform applies a user-defined function to each item. The function may return a modified item, nil to drop the item, or an error. If Fn is nil, the item is returned unchanged.
type Validate ¶
type Validate struct {
// Required is the list of field names that must be present and non-empty.
Required []string
}
Validate is a pipeline stage that drops items missing required fields. A field is considered missing if it is absent from item.Fields or if its value is an empty string.