documentloaders

package
v0.33.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 6, 2026 License: MIT Imports: 23 Imported by: 0

Documentation

Overview

Package documentloaders provides document loading utilities for RAG applications. It includes loaders for git repositories and other document sources with support for streaming, batch processing, and memory protection.

Package documentloaders provides document loading utilities for RAG applications. It includes loaders for git repositories, RSS feeds, and other document sources with support for streaming, batch processing, and memory protection.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrInvalidPath is returned when the repository path is invalid.
	ErrInvalidPath = errors.New("documentloaders: invalid repository path")
	// ErrNilRegistry is returned when the parser registry is nil.
	ErrNilRegistry = errors.New("documentloaders: parser registry is nil")
	// ErrPathNotExist is returned when the path does not exist.
	ErrPathNotExist = errors.New("documentloaders: path does not exist")
	// ErrMemoryLimitExceeded is returned when memory limit is exceeded during loading.
	ErrMemoryLimitExceeded = errors.New("documentloaders: memory limit exceeded")
)

Error variables for document loading operations.

View Source
var (
	// ErrInvalidFeedURL is returned when the feed URL is invalid.
	ErrInvalidFeedURL = errors.New("documentloaders: invalid feed URL")
	// ErrNoFeedURLs is returned when no feed URLs are provided.
	ErrNoFeedURLs = errors.New("documentloaders: no feed URLs provided")
	// ErrFeedFetchFailed is returned when a feed cannot be fetched.
	ErrFeedFetchFailed = errors.New("documentloaders: failed to fetch feed")
	// ErrTimeoutExceeded is returned when the timeout is exceeded.
	ErrTimeoutExceeded = errors.New("documentloaders: timeout exceeded")
	// ErrMaxRetriesExceeded is returned when max retries are exceeded.
	ErrMaxRetriesExceeded = errors.New("documentloaders: max retries exceeded")
)

Error variables for RSS loading operations.

Functions

This section is empty.

Types

type CLICommandLoader added in v0.2.0

type CLICommandLoader struct {
	Command string
	Args    []string
}

func NewCLICommandLoader added in v0.2.0

func NewCLICommandLoader(command string, args ...string) *CLICommandLoader

func (*CLICommandLoader) Load added in v0.2.0

type FileData added in v0.16.0

type FileData struct {
	// Path is the file path relative to the repository root.
	Path string
	// Content is the file content.
	Content string
	// FileInfo contains file metadata.
	FileInfo fs.FileInfo
}

FileData is an in-memory representation of a file to be processed.

type GitLoader

type GitLoader struct {
	// contains filtered or unexported fields
}

GitLoader loads and processes documents from a git repository on the local file system. It supports batch processing, parallel file processing, and memory protection.

func NewGit

func NewGit(path string, registry parsers.ParserRegistry, opts ...GitLoaderOption) (*GitLoader, error)

NewGit creates a new GitLoader for the specified repository path. Returns an error if the path is invalid, registry is nil, or path doesn't exist.

func (*GitLoader) Load

func (g *GitLoader) Load(ctx context.Context) ([]schema.Document, error)

func (*GitLoader) LoadAndProcessStream added in v0.10.0

func (g *GitLoader) LoadAndProcessStream(ctx context.Context, processFn func(ctx context.Context, docs []schema.Document) error) error

LoadAndProcessStream uses a pipeline with controlled memory usage

type GitLoaderOption added in v0.2.0

type GitLoaderOption func(*gitLoaderOptions)

GitLoaderOption configures a GitLoader.

func WithBatchSize added in v0.16.0

func WithBatchSize(size int) GitLoaderOption

WithBatchSize sets the batch size for document processing.

func WithExcludeDirs added in v0.4.0

func WithExcludeDirs(dirs []string) GitLoaderOption

WithExcludeDirs sets directory names to exclude from loading.

func WithExcludeExts added in v0.4.0

func WithExcludeExts(exts []string) GitLoaderOption

WithExcludeExts sets file extensions to exclude from loading. Extensions can be provided with or without the leading dot.

func WithGeneratedCodeDetection added in v0.20.0

func WithGeneratedCodeDetection(enable bool) GitLoaderOption

WithGeneratedCodeDetection enables or disables detection of auto-generated code. When enabled, files detected as generated will be skipped.

func WithIncludeExts added in v0.2.0

func WithIncludeExts(exts []string) GitLoaderOption

WithIncludeExts sets file extensions to include in loading. If set, only files with these extensions will be loaded.

func WithLogger added in v0.2.0

func WithLogger(logger *slog.Logger) GitLoaderOption

WithLogger sets the logger for the loader.

func WithMaxMemoryBuffer added in v0.16.0

func WithMaxMemoryBuffer(bytes int64) GitLoaderOption

WithMaxMemoryBuffer sets the maximum memory buffer in bytes. Processing will pause when memory usage exceeds this limit.

func WithWorkerCount added in v0.16.0

func WithWorkerCount(count int) GitLoaderOption

WithWorkerCount sets the number of parallel workers for file processing.

type Loader

type Loader interface {
	// Load loads all documents from the source.
	Load(ctx context.Context) ([]schema.Document, error)
	// LoadAndProcessStream loads documents in batches and processes them.
	LoadAndProcessStream(ctx context.Context, processFn func(ctx context.Context, docs []schema.Document) error) error
}

Loader is the interface for document loaders.

type NormalizationConfig added in v0.33.0

type NormalizationConfig struct {
	// StripHTML removes all HTML tags. If false, HTML is sanitized instead.
	StripHTML bool
	// MaxContentLength is the maximum length for content (0 = unlimited).
	MaxContentLength int
	// MinContentLength is the minimum length for content (items below this are skipped).
	MinContentLength int
	// RemoveTracking removes tracking parameters from URLs (UTM, fbclid, etc.).
	RemoveTracking bool
	// NormalizeURLs removes fragments and normalizes URL structure.
	NormalizeURLs bool
	// DefaultTimezone is used for parsing dates without timezone info.
	DefaultTimezone *time.Location
	// DateFormats are custom date formats to try when parsing.
	DateFormats []string
	// MinTitleLength is the minimum title length.
	MinTitleLength int
	// FallbackToURL uses URL path as title if title is too short.
	FallbackToURL bool
	// NormalizeAuthors cleans author names (removes emails, quotes).
	NormalizeAuthors bool
	// DeduplicationField specifies which field to use for deduplication ("guid" or "link").
	DeduplicationField string
}

NormalizationConfig configures how RSS content is normalized and cleaned.

type RSSFeedData added in v0.33.0

type RSSFeedData struct {
	URL      string
	Feed     *gofeed.Feed
	Metadata map[string]any
	Error    error
}

RSSFeedData represents a fetched RSS feed with its metadata.

type RSSLoader added in v0.33.0

type RSSLoader struct {
	// contains filtered or unexported fields
}

RSSLoader loads and processes documents from RSS/Atom feeds. It supports batch processing, parallel feed fetching, and content normalization.

func NewRSS added in v0.33.0

func NewRSS(feedURLs []string, registry parsers.ParserRegistry, opts ...RSSLoaderOption) (*RSSLoader, error)

NewRSS creates a new RSSLoader for the specified feed URLs. Returns an error if no URLs are provided or the registry is nil.

func (*RSSLoader) Load added in v0.33.0

func (r *RSSLoader) Load(ctx context.Context) ([]schema.Document, error)

Load fetches all feeds and returns all documents. Warning: This method loads all documents into memory. For large feeds, use LoadAndProcessStream instead for better memory efficiency.

func (*RSSLoader) LoadAndProcessStream added in v0.33.0

func (r *RSSLoader) LoadAndProcessStream(ctx context.Context, processFn func(ctx context.Context, docs []schema.Document) error) error

type RSSLoaderOption added in v0.33.0

type RSSLoaderOption func(*rssLoaderOptions)

RSSLoaderOption configures an RSSLoader.

func WithHTMLParser added in v0.33.0

func WithHTMLParser(htmlParser schema.ParserPlugin) RSSLoaderOption

WithHTMLParser sets an HTML parser for transforming HTML content to Markdown. When set, the HTML parser will:

  • Remove boilerplate (nav, footer, scripts)
  • Normalize links (relative → absolute)
  • Extract metadata (author, date, title)
  • Convert to clean Markdown

This is useful for RSS feeds with rich HTML content that needs to be optimized for LLM consumption before being stored in vector databases.

func WithRSSBatchSize added in v0.33.0

func WithRSSBatchSize(size int) RSSLoaderOption

WithRSSBatchSize sets the batch size for document processing.

func WithRSSHTTPClient added in v0.33.0

func WithRSSHTTPClient(client *http.Client) RSSLoaderOption

WithRSSHTTPClient sets a custom HTTP client for feed requests.

func WithRSSLogger added in v0.33.0

func WithRSSLogger(logger *slog.Logger) RSSLoaderOption

WithRSSLogger sets a custom logger for the loader.

func WithRSSMaxItems added in v0.33.0

func WithRSSMaxItems(count int) RSSLoaderOption

WithRSSMaxItems sets the maximum number of items to fetch per feed.

func WithRSSMaxRetries added in v0.33.0

func WithRSSMaxRetries(retries int) RSSLoaderOption

WithRSSMaxRetries sets the number of retry attempts for failed requests.

func WithRSSNormalization added in v0.33.0

func WithRSSNormalization(config NormalizationConfig) RSSLoaderOption

WithRSSNormalization sets the content normalization configuration.

func WithRSSRateLimit added in v0.33.0

func WithRSSRateLimit(requestsPerSecond int) RSSLoaderOption

WithRSSRateLimit sets the maximum number of requests per second. This helps prevent overwhelming RSS servers with too many concurrent requests. Default is 10 requests per second.

func WithRSSSeenItems added in v0.33.0

func WithRSSSeenItems(seen map[string]bool) RSSLoaderOption

WithRSSSeenItems provides a pre-populated map of seen item GUIDs for deduplication.

func WithRSSSkipDuplicates added in v0.33.0

func WithRSSSkipDuplicates(skip bool) RSSLoaderOption

WithRSSSkipDuplicates enables deduplication of feed items by GUID/link.

func WithRSSTimeout added in v0.33.0

func WithRSSTimeout(timeout time.Duration) RSSLoaderOption

WithRSSTimeout sets the HTTP timeout for feed requests.

func WithRSSUserAgent added in v0.33.0

func WithRSSUserAgent(userAgent string) RSSLoaderOption

WithRSSUserAgent sets a custom User-Agent header for HTTP requests.

func WithRSSWorkerCount added in v0.33.0

func WithRSSWorkerCount(count int) RSSLoaderOption

WithRSSWorkerCount sets the number of parallel workers for feed fetching.

type RSSNormalizer added in v0.33.0

type RSSNormalizer struct {
	// contains filtered or unexported fields
}

RSSNormalizer handles content normalization and sanitization for RSS feeds. It provides HTML sanitization, URL cleaning, date parsing, and metadata normalization.

func NewRSSNormalizer added in v0.33.0

func NewRSSNormalizer(config NormalizationConfig) *RSSNormalizer

NewRSSNormalizer creates a new RSS normalizer with the given configuration. If config values are zero, sensible defaults are applied.

func (*RSSNormalizer) NormalizeAuthor added in v0.33.0

func (n *RSSNormalizer) NormalizeAuthor(author string) string

NormalizeAuthor cleans author names by removing email addresses and quotes. Example: "John Doe <john@example.com>" becomes "John Doe".

func (*RSSNormalizer) NormalizeCategories added in v0.33.0

func (n *RSSNormalizer) NormalizeCategories(categories []string) []string

NormalizeCategories normalizes and deduplicates category strings. Converts to lowercase and removes empty categories.

func (*RSSNormalizer) NormalizeContent added in v0.33.0

func (n *RSSNormalizer) NormalizeContent(content string) string

NormalizeContent normalizes content by sanitizing/stripping HTML and truncating. It removes excess whitespace and enforces length limits.

func (*RSSNormalizer) NormalizeTitle added in v0.33.0

func (n *RSSNormalizer) NormalizeTitle(title string, fallbackURL string) string

NormalizeTitle normalizes the title or generates one from the URL if too short. Extracts the last path segment as a fallback title.

func (*RSSNormalizer) NormalizeURL added in v0.33.0

func (n *RSSNormalizer) NormalizeURL(rawURL string) string

NormalizeURL cleans a URL by removing tracking parameters and fragments. It can remove UTM parameters, fbclid, gclid, and other tracking identifiers.

func (*RSSNormalizer) ParseDate added in v0.33.0

func (n *RSSNormalizer) ParseDate(dateStr string) time.Time

ParseDate attempts to parse a date string using multiple common formats. Supports RFC1123, RFC1123Z, RFC822, RFC822Z, RFC3339, ISO8601, and custom formats. Returns a zero time if parsing fails.

func (*RSSNormalizer) ResolveURL added in v0.33.0

func (n *RSSNormalizer) ResolveURL(baseURL, relativeURL string) string

ResolveURL resolves a relative URL against a base URL. If the URL is already absolute, it's returned as-is.

func (*RSSNormalizer) SanitizeHTML added in v0.33.0

func (n *RSSNormalizer) SanitizeHTML(html string) string

SanitizeHTML sanitizes HTML content using a safe whitelist policy. It removes dangerous elements (script, iframe) and attributes (onclick, etc.). Adds rel="nofollow noopener" and target="_blank" to external links.

func (*RSSNormalizer) ShouldSkipItem added in v0.33.0

func (n *RSSNormalizer) ShouldSkipItem(title, content string) bool

ShouldSkipItem determines if an item should be skipped based on content quality. Returns true if title or content is too short.

func (*RSSNormalizer) StripHTMLTags added in v0.33.0

func (n *RSSNormalizer) StripHTMLTags(html string) string

StripHTMLTags removes all HTML tags, returning plain text.

type RemoteGitRepoLoader added in v0.2.0

type RemoteGitRepoLoader struct {
	RepoURL        string
	ParserRegistry parsers.ParserRegistry
	Logger         *slog.Logger
}

func NewRemoteGitRepoLoader added in v0.2.0

func NewRemoteGitRepoLoader(repoURL string, registry parsers.ParserRegistry, logger *slog.Logger) *RemoteGitRepoLoader

func (*RemoteGitRepoLoader) Load added in v0.2.0

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL