pipeline

package
v0.0.23 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Chain

type Chain struct {
	// contains filtered or unexported fields
}

Chain combines multiple Pipeline stages into a single Pipeline. Stages are executed in the order they were provided to NewChain. If any stage returns nil (drop), the item is dropped and subsequent stages are skipped. If any stage returns an error, processing stops and the error is returned immediately.

func NewChain

func NewChain(stages ...foxhound.Pipeline) *Chain

NewChain returns a Chain that runs the given stages in order.

func (*Chain) Process

func (c *Chain) Process(ctx context.Context, item *foxhound.Item) (*foxhound.Item, error)

Process runs item through each stage in order. Returns nil if any stage drops the item, or an error if any stage fails.

type Clean

type Clean struct {
	// TrimWhitespace calls strings.TrimSpace on every string field value.
	TrimWhitespace bool
	// StripHTML removes HTML tags (matching <…>) from every string field value.
	StripHTML bool
	// NormalizePrice converts currency strings like "$1,234.56" to float64.
	// The $, €, £ currency prefixes and comma separators are removed before parsing.
	// Non-parseable strings are left unchanged.
	NormalizePrice bool
	// NormalizeDate parses common date strings and rewrites them as "2006-01-02".
	// Unrecognised strings are left unchanged.
	NormalizeDate bool
}

Clean performs data cleaning on item fields. Each option is applied in order: TrimWhitespace → StripHTML → NormalizePrice → NormalizeDate.

func (*Clean) Process

func (c *Clean) Process(_ context.Context, item *foxhound.Item) (*foxhound.Item, error)

Process applies the enabled cleaning operations to each string field in item. It returns the modified item; it never drops an item or returns an error.

type FieldTransform

type FieldTransform struct {
	// Field is the source field name.
	Field string
	// RegexFind is the pattern to match (empty = skip regex).
	RegexFind string
	// RegexReplace is the replacement string (supports $1, $2, etc).
	RegexReplace string
	// RenameTo renames the field (empty = keep original name).
	RenameTo string
	// CoerceTo converts the field value: "int", "float", "bool", "string".
	CoerceTo string
}

FieldTransform defines a transformation to apply to a single item field.

type FieldTransformPipeline

type FieldTransformPipeline struct {
	// contains filtered or unexported fields
}

FieldTransformPipeline applies a list of field transformations to each item.

func NewFieldTransformPipeline

func NewFieldTransformPipeline(transforms []FieldTransform) *FieldTransformPipeline

NewFieldTransformPipeline creates a pipeline from a list of transforms.

func (*FieldTransformPipeline) Process

Process applies all transforms to the item.

type ItemDedup

type ItemDedup struct {
	// KeyField is the item field used as the deduplication key.
	KeyField string
	// contains filtered or unexported fields
}

ItemDedup drops duplicate items based on a key field. The first item with a given key value passes through; subsequent items with the same key are dropped (Process returns nil, nil). Items that are missing the key field entirely are also dropped.

ItemDedup is safe for concurrent use.

func NewItemDedup

func NewItemDedup(keyField string) *ItemDedup

NewItemDedup returns an ItemDedup that deduplicates on keyField.

func (*ItemDedup) Process

func (d *ItemDedup) Process(_ context.Context, item *foxhound.Item) (*foxhound.Item, error)

type Transform

type Transform struct {
	// Fn is the user-provided transformation function.
	Fn func(item *foxhound.Item) (*foxhound.Item, error)
}

Transform applies a user-defined function to each item. The function may return a modified item, nil to drop the item, or an error. If Fn is nil, the item is returned unchanged.

func (*Transform) Process

func (t *Transform) Process(_ context.Context, item *foxhound.Item) (*foxhound.Item, error)

Process calls t.Fn with the item and returns its result. If Fn is nil, the item is returned as-is.

type Validate

type Validate struct {
	// Required is the list of field names that must be present and non-empty.
	Required []string
}

Validate is a pipeline stage that drops items missing required fields. A field is considered missing if it is absent from item.Fields or if its value is an empty string.

func (*Validate) Process

func (v *Validate) Process(_ context.Context, item *foxhound.Item) (*foxhound.Item, error)

Process returns nil (dropping the item) if any required field is absent or has an empty string value. Otherwise it returns the item unchanged.

Directories

Path Synopsis
Package export provides Writer implementations for exporting scraped items to various formats and destinations.
Package export provides Writer implementations for exporting scraped items to various formats and destinations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL