Documentation
¶
Overview ¶
Package documentloaders provides document loading utilities for RAG applications. It includes loaders for git repositories and other document sources with support for streaming, batch processing, and memory protection.
Package documentloaders provides document loading utilities for RAG applications. It includes loaders for git repositories, RSS feeds, and other document sources with support for streaming, batch processing, and memory protection.
Index ¶
- Variables
- type CLICommandLoader
- type FileData
- type GitLoader
- type GitLoaderOption
- func WithBatchSize(size int) GitLoaderOption
- func WithExcludeDirs(dirs []string) GitLoaderOption
- func WithExcludeExts(exts []string) GitLoaderOption
- func WithGeneratedCodeDetection(enable bool) GitLoaderOption
- func WithIncludeExts(exts []string) GitLoaderOption
- func WithLogger(logger *slog.Logger) GitLoaderOption
- func WithMaxMemoryBuffer(bytes int64) GitLoaderOption
- func WithWorkerCount(count int) GitLoaderOption
- type Loader
- type NormalizationConfig
- type RSSFeedData
- type RSSLoader
- type RSSLoaderOption
- func WithHTMLParser(htmlParser schema.ParserPlugin) RSSLoaderOption
- func WithRSSBatchSize(size int) RSSLoaderOption
- func WithRSSHTTPClient(client *http.Client) RSSLoaderOption
- func WithRSSLogger(logger *slog.Logger) RSSLoaderOption
- func WithRSSMaxItems(count int) RSSLoaderOption
- func WithRSSMaxRetries(retries int) RSSLoaderOption
- func WithRSSNormalization(config NormalizationConfig) RSSLoaderOption
- func WithRSSRateLimit(requestsPerSecond int) RSSLoaderOption
- func WithRSSSeenItems(seen map[string]bool) RSSLoaderOption
- func WithRSSSkipDuplicates(skip bool) RSSLoaderOption
- func WithRSSTimeout(timeout time.Duration) RSSLoaderOption
- func WithRSSUserAgent(userAgent string) RSSLoaderOption
- func WithRSSWorkerCount(count int) RSSLoaderOption
- type RSSNormalizer
- func (n *RSSNormalizer) NormalizeAuthor(author string) string
- func (n *RSSNormalizer) NormalizeCategories(categories []string) []string
- func (n *RSSNormalizer) NormalizeContent(content string) string
- func (n *RSSNormalizer) NormalizeTitle(title string, fallbackURL string) string
- func (n *RSSNormalizer) NormalizeURL(rawURL string) string
- func (n *RSSNormalizer) ParseDate(dateStr string) time.Time
- func (n *RSSNormalizer) ResolveURL(baseURL, relativeURL string) string
- func (n *RSSNormalizer) SanitizeHTML(html string) string
- func (n *RSSNormalizer) ShouldSkipItem(title, content string) bool
- func (n *RSSNormalizer) StripHTMLTags(html string) string
- type RemoteGitRepoLoader
Constants ¶
This section is empty.
Variables ¶
var ( // ErrInvalidPath is returned when the repository path is invalid. ErrInvalidPath = errors.New("documentloaders: invalid repository path") // ErrNilRegistry is returned when the parser registry is nil. ErrNilRegistry = errors.New("documentloaders: parser registry is nil") // ErrPathNotExist is returned when the path does not exist. ErrPathNotExist = errors.New("documentloaders: path does not exist") // ErrMemoryLimitExceeded is returned when memory limit is exceeded during loading. ErrMemoryLimitExceeded = errors.New("documentloaders: memory limit exceeded") )
Error variables for document loading operations.
var ( // ErrInvalidFeedURL is returned when the feed URL is invalid. ErrInvalidFeedURL = errors.New("documentloaders: invalid feed URL") // ErrNoFeedURLs is returned when no feed URLs are provided. ErrNoFeedURLs = errors.New("documentloaders: no feed URLs provided") // ErrFeedFetchFailed is returned when a feed cannot be fetched. ErrFeedFetchFailed = errors.New("documentloaders: failed to fetch feed") // ErrTimeoutExceeded is returned when the timeout is exceeded. ErrTimeoutExceeded = errors.New("documentloaders: timeout exceeded") // ErrMaxRetriesExceeded is returned when max retries are exceeded. ErrMaxRetriesExceeded = errors.New("documentloaders: max retries exceeded") )
Error variables for RSS loading operations.
Functions ¶
This section is empty.
Types ¶
type CLICommandLoader ¶ added in v0.2.0
func NewCLICommandLoader ¶ added in v0.2.0
func NewCLICommandLoader(command string, args ...string) *CLICommandLoader
type FileData ¶ added in v0.16.0
type FileData struct {
// Path is the file path relative to the repository root.
Path string
// Content is the file content.
Content string
// FileInfo contains file metadata.
FileInfo fs.FileInfo
}
FileData is an in-memory representation of a file to be processed.
type GitLoader ¶
type GitLoader struct {
// contains filtered or unexported fields
}
GitLoader loads and processes documents from a git repository on the local file system. It supports batch processing, parallel file processing, and memory protection.
func NewGit ¶
func NewGit(path string, registry parsers.ParserRegistry, opts ...GitLoaderOption) (*GitLoader, error)
NewGit creates a new GitLoader for the specified repository path. Returns an error if the path is invalid, registry is nil, or path doesn't exist.
type GitLoaderOption ¶ added in v0.2.0
type GitLoaderOption func(*gitLoaderOptions)
GitLoaderOption configures a GitLoader.
func WithBatchSize ¶ added in v0.16.0
func WithBatchSize(size int) GitLoaderOption
WithBatchSize sets the batch size for document processing.
func WithExcludeDirs ¶ added in v0.4.0
func WithExcludeDirs(dirs []string) GitLoaderOption
WithExcludeDirs sets directory names to exclude from loading.
func WithExcludeExts ¶ added in v0.4.0
func WithExcludeExts(exts []string) GitLoaderOption
WithExcludeExts sets file extensions to exclude from loading. Extensions can be provided with or without the leading dot.
func WithGeneratedCodeDetection ¶ added in v0.20.0
func WithGeneratedCodeDetection(enable bool) GitLoaderOption
WithGeneratedCodeDetection enables or disables detection of auto-generated code. When enabled, files detected as generated will be skipped.
func WithIncludeExts ¶ added in v0.2.0
func WithIncludeExts(exts []string) GitLoaderOption
WithIncludeExts sets file extensions to include in loading. If set, only files with these extensions will be loaded.
func WithLogger ¶ added in v0.2.0
func WithLogger(logger *slog.Logger) GitLoaderOption
WithLogger sets the logger for the loader.
func WithMaxMemoryBuffer ¶ added in v0.16.0
func WithMaxMemoryBuffer(bytes int64) GitLoaderOption
WithMaxMemoryBuffer sets the maximum memory buffer in bytes. Processing will pause when memory usage exceeds this limit.
func WithWorkerCount ¶ added in v0.16.0
func WithWorkerCount(count int) GitLoaderOption
WithWorkerCount sets the number of parallel workers for file processing.
type Loader ¶
type Loader interface {
// Load loads all documents from the source.
Load(ctx context.Context) ([]schema.Document, error)
// LoadAndProcessStream loads documents in batches and processes them.
LoadAndProcessStream(ctx context.Context, processFn func(ctx context.Context, docs []schema.Document) error) error
}
Loader is the interface for document loaders.
type NormalizationConfig ¶ added in v0.33.0
type NormalizationConfig struct {
// StripHTML removes all HTML tags. If false, HTML is sanitized instead.
StripHTML bool
// MaxContentLength is the maximum length for content (0 = unlimited).
MaxContentLength int
// MinContentLength is the minimum length for content (items below this are skipped).
MinContentLength int
// RemoveTracking removes tracking parameters from URLs (UTM, fbclid, etc.).
RemoveTracking bool
// NormalizeURLs removes fragments and normalizes URL structure.
NormalizeURLs bool
// DefaultTimezone is used for parsing dates without timezone info.
DefaultTimezone *time.Location
// DateFormats are custom date formats to try when parsing.
DateFormats []string
// MinTitleLength is the minimum title length.
MinTitleLength int
// FallbackToURL uses URL path as title if title is too short.
FallbackToURL bool
// NormalizeAuthors cleans author names (removes emails, quotes).
NormalizeAuthors bool
// DeduplicationField specifies which field to use for deduplication ("guid" or "link").
DeduplicationField string
}
NormalizationConfig configures how RSS content is normalized and cleaned.
type RSSFeedData ¶ added in v0.33.0
RSSFeedData represents a fetched RSS feed with its metadata.
type RSSLoader ¶ added in v0.33.0
type RSSLoader struct {
// contains filtered or unexported fields
}
RSSLoader loads and processes documents from RSS/Atom feeds. It supports batch processing, parallel feed fetching, and content normalization.
func NewRSS ¶ added in v0.33.0
func NewRSS(feedURLs []string, registry parsers.ParserRegistry, opts ...RSSLoaderOption) (*RSSLoader, error)
NewRSS creates a new RSSLoader for the specified feed URLs. Returns an error if no URLs are provided or the registry is nil.
type RSSLoaderOption ¶ added in v0.33.0
type RSSLoaderOption func(*rssLoaderOptions)
RSSLoaderOption configures an RSSLoader.
func WithHTMLParser ¶ added in v0.33.0
func WithHTMLParser(htmlParser schema.ParserPlugin) RSSLoaderOption
WithHTMLParser sets an HTML parser for transforming HTML content to Markdown. When set, the HTML parser will:
- Remove boilerplate (nav, footer, scripts)
- Normalize links (relative → absolute)
- Extract metadata (author, date, title)
- Convert to clean Markdown
This is useful for RSS feeds with rich HTML content that needs to be optimized for LLM consumption before being stored in vector databases.
func WithRSSBatchSize ¶ added in v0.33.0
func WithRSSBatchSize(size int) RSSLoaderOption
WithRSSBatchSize sets the batch size for document processing.
func WithRSSHTTPClient ¶ added in v0.33.0
func WithRSSHTTPClient(client *http.Client) RSSLoaderOption
WithRSSHTTPClient sets a custom HTTP client for feed requests.
func WithRSSLogger ¶ added in v0.33.0
func WithRSSLogger(logger *slog.Logger) RSSLoaderOption
WithRSSLogger sets a custom logger for the loader.
func WithRSSMaxItems ¶ added in v0.33.0
func WithRSSMaxItems(count int) RSSLoaderOption
WithRSSMaxItems sets the maximum number of items to fetch per feed.
func WithRSSMaxRetries ¶ added in v0.33.0
func WithRSSMaxRetries(retries int) RSSLoaderOption
WithRSSMaxRetries sets the number of retry attempts for failed requests.
func WithRSSNormalization ¶ added in v0.33.0
func WithRSSNormalization(config NormalizationConfig) RSSLoaderOption
WithRSSNormalization sets the content normalization configuration.
func WithRSSRateLimit ¶ added in v0.33.0
func WithRSSRateLimit(requestsPerSecond int) RSSLoaderOption
WithRSSRateLimit sets the maximum number of requests per second. This helps prevent overwhelming RSS servers with too many concurrent requests. Default is 10 requests per second.
func WithRSSSeenItems ¶ added in v0.33.0
func WithRSSSeenItems(seen map[string]bool) RSSLoaderOption
WithRSSSeenItems provides a pre-populated map of seen item GUIDs for deduplication.
func WithRSSSkipDuplicates ¶ added in v0.33.0
func WithRSSSkipDuplicates(skip bool) RSSLoaderOption
WithRSSSkipDuplicates enables deduplication of feed items by GUID/link.
func WithRSSTimeout ¶ added in v0.33.0
func WithRSSTimeout(timeout time.Duration) RSSLoaderOption
WithRSSTimeout sets the HTTP timeout for feed requests.
func WithRSSUserAgent ¶ added in v0.33.0
func WithRSSUserAgent(userAgent string) RSSLoaderOption
WithRSSUserAgent sets a custom User-Agent header for HTTP requests.
func WithRSSWorkerCount ¶ added in v0.33.0
func WithRSSWorkerCount(count int) RSSLoaderOption
WithRSSWorkerCount sets the number of parallel workers for feed fetching.
type RSSNormalizer ¶ added in v0.33.0
type RSSNormalizer struct {
// contains filtered or unexported fields
}
RSSNormalizer handles content normalization and sanitization for RSS feeds. It provides HTML sanitization, URL cleaning, date parsing, and metadata normalization.
func NewRSSNormalizer ¶ added in v0.33.0
func NewRSSNormalizer(config NormalizationConfig) *RSSNormalizer
NewRSSNormalizer creates a new RSS normalizer with the given configuration. If config values are zero, sensible defaults are applied.
func (*RSSNormalizer) NormalizeAuthor ¶ added in v0.33.0
func (n *RSSNormalizer) NormalizeAuthor(author string) string
NormalizeAuthor cleans author names by removing email addresses and quotes. Example: "John Doe <john@example.com>" becomes "John Doe".
func (*RSSNormalizer) NormalizeCategories ¶ added in v0.33.0
func (n *RSSNormalizer) NormalizeCategories(categories []string) []string
NormalizeCategories normalizes and deduplicates category strings. Converts to lowercase and removes empty categories.
func (*RSSNormalizer) NormalizeContent ¶ added in v0.33.0
func (n *RSSNormalizer) NormalizeContent(content string) string
NormalizeContent normalizes content by sanitizing/stripping HTML and truncating. It removes excess whitespace and enforces length limits.
func (*RSSNormalizer) NormalizeTitle ¶ added in v0.33.0
func (n *RSSNormalizer) NormalizeTitle(title string, fallbackURL string) string
NormalizeTitle normalizes the title or generates one from the URL if too short. Extracts the last path segment as a fallback title.
func (*RSSNormalizer) NormalizeURL ¶ added in v0.33.0
func (n *RSSNormalizer) NormalizeURL(rawURL string) string
NormalizeURL cleans a URL by removing tracking parameters and fragments. It can remove UTM parameters, fbclid, gclid, and other tracking identifiers.
func (*RSSNormalizer) ParseDate ¶ added in v0.33.0
func (n *RSSNormalizer) ParseDate(dateStr string) time.Time
ParseDate attempts to parse a date string using multiple common formats. Supports RFC1123, RFC1123Z, RFC822, RFC822Z, RFC3339, ISO8601, and custom formats. Returns a zero time if parsing fails.
func (*RSSNormalizer) ResolveURL ¶ added in v0.33.0
func (n *RSSNormalizer) ResolveURL(baseURL, relativeURL string) string
ResolveURL resolves a relative URL against a base URL. If the URL is already absolute, it's returned as-is.
func (*RSSNormalizer) SanitizeHTML ¶ added in v0.33.0
func (n *RSSNormalizer) SanitizeHTML(html string) string
SanitizeHTML sanitizes HTML content using a safe whitelist policy. It removes dangerous elements (script, iframe) and attributes (onclick, etc.). Adds rel="nofollow noopener" and target="_blank" to external links.
func (*RSSNormalizer) ShouldSkipItem ¶ added in v0.33.0
func (n *RSSNormalizer) ShouldSkipItem(title, content string) bool
ShouldSkipItem determines if an item should be skipped based on content quality. Returns true if title or content is too short.
func (*RSSNormalizer) StripHTMLTags ¶ added in v0.33.0
func (n *RSSNormalizer) StripHTMLTags(html string) string
StripHTMLTags removes all HTML tags, returning plain text.
type RemoteGitRepoLoader ¶ added in v0.2.0
type RemoteGitRepoLoader struct {
RepoURL string
ParserRegistry parsers.ParserRegistry
Logger *slog.Logger
}
func NewRemoteGitRepoLoader ¶ added in v0.2.0
func NewRemoteGitRepoLoader(repoURL string, registry parsers.ParserRegistry, logger *slog.Logger) *RemoteGitRepoLoader