Documentation
¶
Overview ¶
Package defuddle provides web content extraction and demuddling capabilities.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrNotHTML is returned when the fetched content is not HTML. ErrNotHTML = errors.New("defuddle: content is not HTML") // ErrTooLarge is returned when the fetched content exceeds the size limit. ErrTooLarge = errors.New("defuddle: content exceeds size limit") // ErrTimeout is returned when a fetch operation times out. ErrTimeout = errors.New("defuddle: request timed out") // ErrNoContent is returned when no main content could be extracted. ErrNoContent = errors.New("defuddle: no content extracted") )
Sentinel errors for caller-branching logic via errors.Is().
var Version = "dev"
Version is the library version, set at build time via -ldflags.
Functions ¶
func BoolDefault ¶
BoolDefault returns the value pointed to by b, or defaultVal if b is nil.
Types ¶
type Defuddle ¶
type Defuddle struct {
// contains filtered or unexported fields
}
Defuddle represents a document parser instance
func NewDefuddle ¶
NewDefuddle creates a new Defuddle instance from HTML content JavaScript original code:
constructor(document: Document, options: DefuddleOptions = {}) {
this.doc = document;
this.options = options;
}
type ExtractedContent ¶
type ExtractedContent struct {
Title *string `json:"title,omitempty"`
Author *string `json:"author,omitempty"`
Published *string `json:"published,omitempty"`
Content *string `json:"content,omitempty"`
ContentHTML *string `json:"contentHtml,omitempty"`
Variables *ExtractorVariables `json:"variables,omitempty"`
}
ExtractedContent represents content extracted by site-specific extractors JavaScript original code:
export interface ExtractedContent {
title?: string;
author?: string;
published?: string;
content?: string;
contentHtml?: string;
variables?: ExtractorVariables;
}
type ExtractorVariables ¶
ExtractorVariables represents variables extracted by site-specific extractors JavaScript original code:
export interface ExtractorVariables {
[key: string]: string;
}
type MetaTag ¶
MetaTag represents a meta tag item from HTML This is an alias to the internal metadata.MetaTag type
type Metadata ¶
Metadata represents extracted metadata from a document This is an alias to the internal metadata.Metadata type
type Options ¶
type Options struct {
// Enable debug logging
Debug bool `json:"debug,omitempty"`
// URL of the page being parsed
URL string `json:"url,omitempty"`
// Convert output to Markdown
Markdown bool `json:"markdown,omitempty"`
// Include Markdown in the response
SeparateMarkdown bool `json:"separateMarkdown,omitempty"`
// Whether to remove elements matching exact selectors like ads, social buttons, etc.
// nil = true (default). Use PtrBool(false) to disable.
RemoveExactSelectors *bool `json:"removeExactSelectors,omitempty"`
// Whether to remove elements matching partial selectors like ads, social buttons, etc.
// nil = true (default). Use PtrBool(false) to disable.
RemovePartialSelectors *bool `json:"removePartialSelectors,omitempty"`
// Remove images from the extracted content
// Defaults to false.
RemoveImages bool `json:"removeImages,omitempty"`
// Whether to remove hidden elements (display:none, Tailwind hidden classes).
// nil = true (default). Use PtrBool(false) to disable.
RemoveHiddenElements *bool `json:"removeHiddenElements,omitempty"`
// Whether to remove low-scoring non-content blocks.
// nil = true (default). Use PtrBool(false) to disable.
RemoveLowScoring *bool `json:"removeLowScoring,omitempty"`
// Whether to remove content patterns (boilerplate, breadcrumbs, etc.).
// nil = true (default). Use PtrBool(false) to disable.
RemoveContentPatterns *bool `json:"removeContentPatterns,omitempty"`
// CSS selector to use for content extraction instead of auto-detection.
ContentSelector string `json:"contentSelector,omitempty"`
// Element processing options
ProcessCode bool `json:"processCode,omitempty"`
ProcessImages bool `json:"processImages,omitempty"`
ProcessHeadings bool `json:"processHeadings,omitempty"`
ProcessMath bool `json:"processMath,omitempty"`
ProcessFootnotes bool `json:"processFootnotes,omitempty"`
ProcessRoles bool `json:"processRoles,omitempty"`
CodeOptions *elements.CodeBlockProcessingOptions `json:"codeOptions,omitempty"`
ImageOptions *elements.ImageProcessingOptions `json:"imageOptions,omitempty"`
HeadingOptions *elements.HeadingProcessingOptions `json:"headingOptions,omitempty"`
MathOptions *elements.MathProcessingOptions `json:"mathOptions,omitempty"`
FootnoteOptions *elements.FootnoteProcessingOptions `json:"footnoteOptions,omitempty"`
RoleOptions *elements.RoleProcessingOptions `json:"roleOptions,omitempty"`
// Client is a custom HTTP client for fetching URLs.
// If nil, a default client with standard User-Agent and 30s timeout is created.
Client *requests.Client `json:"-"`
// MaxConcurrency limits parallel URL fetches in ParseFromURLs.
// Defaults to 5 if zero.
MaxConcurrency int `json:"maxConcurrency,omitempty"`
}
Options represents configuration options for Defuddle parsing JavaScript original code:
export interface DefuddleOptions {
debug?: boolean;
url?: string;
markdown?: boolean;
separateMarkdown?: boolean;
removeExactSelectors?: boolean;
removePartialSelectors?: boolean;
}
type Result ¶
type Result struct {
Metadata
Content string `json:"content"`
ContentMarkdown *string `json:"contentMarkdown,omitempty"`
ExtractorType *string `json:"extractorType,omitempty"`
Variables map[string]string `json:"variables,omitempty"`
MetaTags []MetaTag `json:"metaTags,omitempty"`
DebugInfo *debug.Info `json:"debugInfo,omitempty"`
}
Result represents the complete response from Defuddle parsing JavaScript original code:
export interface DefuddleResponse extends DefuddleMetadata {
content: string;
contentMarkdown?: string;
extractorType?: string;
metaTags?: MetaTagItem[];
}
func ParseFromString ¶
ParseFromString parses HTML content directly from a string This is useful when you already have the HTML content (e.g., from browser automation)
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
defuddle
command
Package main provides the defuddle CLI application.
|
Package main provides the defuddle CLI application. |
|
examples
|
|
|
advanced
command
Package main demonstrates advanced defuddle usage.
|
Package main demonstrates advanced defuddle usage. |
|
basic
command
Package main demonstrates basic defuddle usage.
|
Package main demonstrates basic defuddle usage. |
|
custom_extractor
command
Package main demonstrates custom extractor usage.
|
Package main demonstrates custom extractor usage. |
|
extractors
command
Package main demonstrates extractors usage.
|
Package main demonstrates extractors usage. |
|
markdown
command
Package main demonstrates markdown conversion.
|
Package main demonstrates markdown conversion. |
|
Package extractors provides site-specific content extraction functionality.
|
Package extractors provides site-specific content extraction functionality. |
|
internal
|
|
|
constants
Package constants provides configuration constants and selectors for the defuddle content extraction system.
|
Package constants provides configuration constants and selectors for the defuddle content extraction system. |
|
debug
Package debug provides debugging functionality for the defuddle content extraction system.
|
Package debug provides debugging functionality for the defuddle content extraction system. |
|
elements
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
|
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting |
|
markdown
Package markdown provides HTML to Markdown conversion functionality.
|
Package markdown provides HTML to Markdown conversion functionality. |
|
metadata
Package metadata provides functionality for extracting and processing document metadata.
|
Package metadata provides functionality for extracting and processing document metadata. |
|
removals
Package removals provides content-pattern-based removal for the defuddle extraction pipeline.
|
Package removals provides content-pattern-based removal for the defuddle extraction pipeline. |
|
scoring
Package scoring provides content scoring functionality for the defuddle content extraction system.
|
Package scoring provides content scoring functionality for the defuddle content extraction system. |
|
standardize
Package standardize provides content standardization functionality for the defuddle content extraction system.
|
Package standardize provides content standardization functionality for the defuddle content extraction system. |
|
text
Package text provides text analysis utilities for content extraction.
|
Package text provides text analysis utilities for content extraction. |
|
urlutil
Package urlutil provides URL resolution and sanitization for extracted content.
|
Package urlutil provides URL resolution and sanitization for extracted content. |