defuddle

package module

v0.4.0 Latest Latest Go to latest Published: Apr 22, 2026 License: MIT Imports: 28 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dotcommander/defuddle

Links

Open Source Insights

README ¶

Introduction

Defuddle Go is a port of the Defuddle TypeScript library. It extracts clean, readable content from any web page — stripping away navigation, ads, sidebars, and other clutter so you're left with just the article.

Available as both a Go library and a drop-in CLI tool compatible with the original Defuddle CLI.

Installation

CLI

Download a pre-built binary from the releases page, or install with Go:

go install github.com/dotcommander/defuddle/cmd/defuddle@latest

Library

Require Defuddle Go using go get:

go get github.com/dotcommander/defuddle

Requires Go 1.26 or higher.

Quick Start

Extract the main content from any web page in just a few lines:

d, err := defuddle.NewDefuddle(htmlString, nil)
if err != nil {
    log.Fatal(err)
}

result, err := d.Parse(context.Background())
if err != nil {
    log.Fatal(err)
}

fmt.Println(result.Title)
fmt.Println(result.Content)

Or fetch and parse a URL directly:

result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", nil)

Extracting Content

From HTML

Pass raw HTML and receive structured content with metadata:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    URL: "https://example.com/article",
})
if err != nil {
    log.Fatal(err)
}

result, err := d.Parse(context.Background())

fmt.Printf("Title:       %s\n", result.Title)
fmt.Printf("Author:      %s\n", result.Author)
fmt.Printf("Published:   %s\n", result.Published)
fmt.Printf("Description: %s\n", result.Description)
fmt.Printf("Word Count:  %d\n", result.WordCount)
fmt.Printf("Language:    %s\n", result.Language)

From a URL

ParseFromURL handles HTTP fetching, encoding detection, and parsing in one call:

result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", &defuddle.Options{
    Markdown: true,
})

Markdown Output

Convert extracted content to Markdown for storage, indexing, or LLM consumption:

result, err := d.Parse(ctx)

// When Markdown is enabled, Content is returned as Markdown
fmt.Println(result.Content)

To receive both HTML and Markdown in the same response:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    SeparateMarkdown: true,
})

result, err := d.Parse(ctx)

fmt.Println(result.Content)          // HTML
fmt.Println(*result.ContentMarkdown) // Markdown

Site-Specific Extractors

Defuddle automatically detects popular platforms and applies specialized extraction logic. No configuration needed — if the URL matches, the right extractor activates.

Platform	Content Type
ChatGPT	Conversations with role-separated messages
Claude	Conversations with human/assistant turns
Gemini	Google AI conversations
Grok	xAI conversations
GitHub	Issues and pull requests with comments
Hacker News	Posts and threaded comment discussions
Reddit	Posts with comment trees
Substack	Newsletter articles
Twitter / X	Tweets and threads
X Articles	Long-form articles (Draft.js)
YouTube	Video metadata and descriptions

Custom Extractors

Implement the BaseExtractor interface to add support for any site:

type MyExtractor struct {
    *extractors.ExtractorBase
}

func NewMyExtractor(doc *goquery.Document, url string, schema any) extractors.BaseExtractor {
    return &MyExtractor{ExtractorBase: extractors.NewExtractorBase(doc, url, schema)}
}

func (e *MyExtractor) Name() string     { return "MyExtractor" }
func (e *MyExtractor) CanExtract() bool { return true }

func (e *MyExtractor) Extract() *extractors.ExtractorResult {
    doc := e.GetDocument()
    content, _ := doc.Find(".article-body").Html()
    return &extractors.ExtractorResult{
        ContentHTML: content,
        Variables:   map[string]string{"site": "My Site"},
    }
}

Register it before parsing:

extractors.Register(extractors.ExtractorMapping{
    Patterns:  []any{"mysite.com"},
    Extractor: NewMyExtractor,
})

Configuration

Options

All options have sensible defaults. Pass nil for zero-config extraction.

opts := &defuddle.Options{
    // Output
    Markdown:         false, // Return content as Markdown
    SeparateMarkdown: false, // Return both HTML and Markdown

    // Content selection
    ContentSelector:  "",    // CSS selector override for main content
    URL:              "",    // Source URL (used for link resolution and domain detection)

    // Removal controls — pointer bools default to true when nil.
    // Use defuddle.PtrBool(false) to explicitly disable.
    RemoveExactSelectors:   nil, // Remove known clutter (ads, nav, social buttons)
    RemovePartialSelectors: nil, // Remove probable clutter (class/id pattern matching)
    RemoveHiddenElements:   nil, // Remove display:none and hidden elements
    RemoveContentPatterns:  nil, // Remove boilerplate (breadcrumbs, related posts, etc.)
    RemoveLowScoring:       nil, // Remove low-scoring non-content blocks
    RemoveImages:           false,// Strip all images from output

    // Element processing
    ProcessCode:      false, // Normalize code blocks with language detection
    ProcessImages:    false, // Optimize images (lazy-load resolution, srcset)
    ProcessHeadings:  false, // Clean heading hierarchy
    ProcessMath:      false, // Normalize MathJax/KaTeX formulas
    ProcessFootnotes: false, // Standardize footnote format
    ProcessRoles:     false, // Convert ARIA roles to semantic HTML

    // HTTP (for ParseFromURL)
    Client:         nil,   // Custom *requests.Client
    MaxConcurrency: 5,     // Parallel limit for ParseFromURLs
    Debug:          false, // Emit debug processing info
}

Content Selector

Override automatic content detection with a CSS selector:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    ContentSelector: "article.post-body",
})

The Extraction Pipeline

Defuddle processes content through a multi-stage pipeline:

HTML Input
 |
 v
1. Schema.org         -- Extract JSON-LD structured data
2. Site Detection      -- Match URL to specialized extractor
3. Shadow DOM          -- Flatten shadow roots and resolve React SSR
4. Selector Removal    -- Strip known clutter by CSS selector
5. Content Scoring     -- Score nodes and identify main content
6. Content Patterns    -- Remove boilerplate (breadcrumbs, related posts, newsletters)
7. Standardization     -- Normalize headings, footnotes, code blocks, images, math
8. Markdown            -- Convert to Markdown (if requested)
 |
 v
Result

The pipeline includes an automatic retry cascade: if initial extraction yields fewer than 50 words, Defuddle progressively relaxes removal filters to recover content from heavily-decorated pages.

The Result Object

Field	Type	Description
`Title`	`string`	Article title
`Author`	`string`	Article author
`Description`	`string`	Article description or summary
`Domain`	`string`	Website domain
`Favicon`	`string`	Website favicon URL
`Image`	`string`	Main article image URL
`Published`	`string`	Publication date
`Language`	`string`	Content language (BCP 47)
`Site`	`string`	Website name
`Content`	`string`	Cleaned HTML (or Markdown if enabled)
`ContentMarkdown`	`*string`	Markdown version (with `SeparateMarkdown`)
`WordCount`	`int`	Word count of extracted content
`ParseTime`	`int64`	Parse duration in milliseconds
`SchemaOrgData`	`any`	Schema.org structured data
`Variables`	`map[string]string`	Extractor-specific variables
`MetaTags`	`[]MetaTag`	Document meta tags
`ExtractorType`	`*string`	Which extractor was used
`DebugInfo`	`*debug.Info`	Debug processing steps (with `Debug`)

CLI Usage

The defuddle command provides a fast interface for content extraction, fully compatible with the original TypeScript CLI.

Extracting Content

# From a URL
defuddle parse https://example.com/article

# From a local file
defuddle parse article.html

# As Markdown
defuddle parse https://example.com/article --markdown

# As JSON with all metadata
defuddle parse https://example.com/article --json

# Extract a single field
defuddle parse https://example.com/article --property title

Saving Output

defuddle parse https://example.com/article --markdown --output article.md

Authentication and Proxies

# Custom headers
defuddle parse https://example.com --header "Authorization: Bearer token123"

# Through a proxy
defuddle parse https://example.com --proxy http://localhost:8080

# Custom timeout
defuddle parse https://slow-site.com --timeout 120s

All CLI Options

Option	Short	Description
`--output`	`-o`	Output file path (default: stdout)
`--markdown`	`-m`	Convert content to Markdown
`--json`	`-j`	Output as JSON with metadata
`--property`	`-p`	Extract a specific property
`--header`	`-H`	Custom header (repeatable)
`--proxy`		Proxy URL
`--user-agent`		Custom user agent
`--timeout`		Request timeout (default: 30s)
`--debug`		Enable debug output

Examples

The examples/ directory contains ready-to-run programs:

go run ./examples/basic              # Simple extraction
go run ./examples/markdown           # HTML to Markdown
go run ./examples/advanced           # Full option usage
go run ./examples/extractors         # Site-specific extraction
go run ./examples/custom_extractor   # Building a custom extractor

Testing

# Run all tests
go test ./...

# With race detection
go test -race ./...

# Benchmarks
go test -bench=. -benchmem ./...

Credits

Defuddle by Steph Ango (@kepano) — the original TypeScript library
Defuddle CLI by Steph Ango — the original CLI tool
Inspired by Mozilla's Readability algorithm

License

Defuddle Go is open-sourced software licensed under the MIT license.

Documentation ¶

Overview ¶

Package defuddle provides web content extraction and demuddling capabilities.

Index ¶

Variables
func BoolDefault(b *bool, defaultVal bool) bool
func PtrBool(v bool) *bool
type Defuddle
- func NewDefuddle(html string, options *Options) (*Defuddle, error)
- func (d *Defuddle) Parse(ctx context.Context) (*Result, error)
type ExtractedContent
type ExtractorVariables
type MetaTag
type Metadata
type Options
type Result
- func ParseFromString(ctx context.Context, html string, options *Options) (*Result, error)
- func ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)
type URLResult
- func ParseFromURLs(ctx context.Context, urls []string, options *Options) []URLResult

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrNotHTML is returned when the fetched content is not HTML.
	ErrNotHTML = errors.New("defuddle: content is not HTML")

	// ErrTooLarge is returned when the fetched content exceeds the size limit.
	ErrTooLarge = errors.New("defuddle: content exceeds size limit")

	// ErrTimeout is returned when a fetch operation times out.
	ErrTimeout = errors.New("defuddle: request timed out")

	// ErrNoContent is returned when no main content could be extracted.
	ErrNoContent = errors.New("defuddle: no content extracted")
)

Sentinel errors for caller-branching logic via errors.Is().

View Source

var Version = "dev"

Version is the library version, set at build time via -ldflags.

Functions ¶

func BoolDefault ¶

func BoolDefault(b *bool, defaultVal bool) bool

BoolDefault returns the value pointed to by b, or defaultVal if b is nil.

func PtrBool ¶

func PtrBool(v bool) *bool

PtrBool returns a pointer to the given bool value. Use this to explicitly set *bool fields in Options (e.g., PtrBool(false) to disable defaults).

Types ¶

type Defuddle ¶

type Defuddle struct {
	// contains filtered or unexported fields
}

Defuddle represents a document parser instance

func NewDefuddle ¶

func NewDefuddle(html string, options *Options) (*Defuddle, error)

NewDefuddle creates a new Defuddle instance from HTML content JavaScript original code:

constructor(document: Document, options: DefuddleOptions = {}) {
  this.doc = document;
  this.options = options;
}

func (*Defuddle) Parse ¶

func (d *Defuddle) Parse(ctx context.Context) (*Result, error)

Parse parses the document and returns the extracted content.

type ExtractedContent ¶

type ExtractedContent struct {
	Title       *string             `json:"title,omitempty"`
	Author      *string             `json:"author,omitempty"`
	Published   *string             `json:"published,omitempty"`
	Content     *string             `json:"content,omitempty"`
	ContentHTML *string             `json:"contentHtml,omitempty"`
	Variables   *ExtractorVariables `json:"variables,omitempty"`
}

ExtractedContent represents content extracted by site-specific extractors JavaScript original code:

export interface ExtractedContent {
  title?: string;
  author?: string;
  published?: string;
  content?: string;
  contentHtml?: string;
  variables?: ExtractorVariables;
}

type ExtractorVariables ¶

type ExtractorVariables map[string]string

ExtractorVariables represents variables extracted by site-specific extractors JavaScript original code:

export interface ExtractorVariables {
  [key: string]: string;
}

type MetaTag ¶

type MetaTag = metadata.MetaTag

MetaTag represents a meta tag item from HTML This is an alias to the internal metadata.MetaTag type

type Metadata ¶

type Metadata = metadata.Metadata

Metadata represents extracted metadata from a document This is an alias to the internal metadata.Metadata type

type Options ¶

type Options struct {
	// Enable debug logging
	Debug bool `json:"debug,omitempty"`

	// URL of the page being parsed
	URL string `json:"url,omitempty"`

	// Convert output to Markdown
	Markdown bool `json:"markdown,omitempty"`

	// Include Markdown in the response
	SeparateMarkdown bool `json:"separateMarkdown,omitempty"`

	// Whether to remove elements matching exact selectors like ads, social buttons, etc.
	// nil = true (default). Use PtrBool(false) to disable.
	RemoveExactSelectors *bool `json:"removeExactSelectors,omitempty"`

	// Whether to remove elements matching partial selectors like ads, social buttons, etc.
	// nil = true (default). Use PtrBool(false) to disable.
	RemovePartialSelectors *bool `json:"removePartialSelectors,omitempty"`

	// Remove images from the extracted content
	// Defaults to false.
	RemoveImages bool `json:"removeImages,omitempty"`

	// Whether to remove hidden elements (display:none, Tailwind hidden classes).
	// nil = true (default). Use PtrBool(false) to disable.
	RemoveHiddenElements *bool `json:"removeHiddenElements,omitempty"`

	// Whether to remove low-scoring non-content blocks.
	// nil = true (default). Use PtrBool(false) to disable.
	RemoveLowScoring *bool `json:"removeLowScoring,omitempty"`

	// Whether to remove content patterns (boilerplate, breadcrumbs, etc.).
	// nil = true (default). Use PtrBool(false) to disable.
	RemoveContentPatterns *bool `json:"removeContentPatterns,omitempty"`

	// CSS selector to use for content extraction instead of auto-detection.
	ContentSelector string `json:"contentSelector,omitempty"`

	// Element processing options
	ProcessCode      bool                                 `json:"processCode,omitempty"`
	ProcessImages    bool                                 `json:"processImages,omitempty"`
	ProcessHeadings  bool                                 `json:"processHeadings,omitempty"`
	ProcessMath      bool                                 `json:"processMath,omitempty"`
	ProcessFootnotes bool                                 `json:"processFootnotes,omitempty"`
	ProcessRoles     bool                                 `json:"processRoles,omitempty"`
	CodeOptions      *elements.CodeBlockProcessingOptions `json:"codeOptions,omitempty"`
	ImageOptions     *elements.ImageProcessingOptions     `json:"imageOptions,omitempty"`
	HeadingOptions   *elements.HeadingProcessingOptions   `json:"headingOptions,omitempty"`
	MathOptions      *elements.MathProcessingOptions      `json:"mathOptions,omitempty"`
	FootnoteOptions  *elements.FootnoteProcessingOptions  `json:"footnoteOptions,omitempty"`
	RoleOptions      *elements.RoleProcessingOptions      `json:"roleOptions,omitempty"`

	// Client is a custom HTTP client for fetching URLs.
	// If nil, a default client with standard User-Agent and 30s timeout is created.
	Client *requests.Client `json:"-"`

	// MaxConcurrency limits parallel URL fetches in ParseFromURLs.
	// Defaults to 5 if zero.
	MaxConcurrency int `json:"maxConcurrency,omitempty"`
}

Options represents configuration options for Defuddle parsing JavaScript original code:

export interface DefuddleOptions {
  debug?: boolean;
  url?: string;
  markdown?: boolean;
  separateMarkdown?: boolean;
  removeExactSelectors?: boolean;
  removePartialSelectors?: boolean;
}

type Result ¶

type Result struct {
	Metadata
	Content         string            `json:"content"`
	ContentMarkdown *string           `json:"contentMarkdown,omitempty"`
	ExtractorType   *string           `json:"extractorType,omitempty"`
	Variables       map[string]string `json:"variables,omitempty"`
	MetaTags        []MetaTag         `json:"metaTags,omitempty"`
	DebugInfo       *debug.Info       `json:"debugInfo,omitempty"`
}

Result represents the complete response from Defuddle parsing JavaScript original code:

export interface DefuddleResponse extends DefuddleMetadata {
  content: string;
  contentMarkdown?: string;
  extractorType?: string;
  metaTags?: MetaTagItem[];
}

func ParseFromString ¶

func ParseFromString(ctx context.Context, html string, options *Options) (*Result, error)

ParseFromString parses HTML content directly from a string This is useful when you already have the HTML content (e.g., from browser automation)

func ParseFromURL ¶

func ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)

ParseFromURL fetches content from a URL and parses it. This corresponds to Node.js usage: Defuddle(htmlOrDom, url?, options?)

type URLResult ¶

type URLResult struct {
	URL    string
	Result *Result
	Err    error
}

URLResult pairs a URL with its extraction result or error.

func ParseFromURLs ¶

func ParseFromURLs(ctx context.Context, urls []string, options *Options) []URLResult

ParseFromURLs fetches and parses multiple URLs concurrently. MaxConcurrency in options controls parallelism (default 5).

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
defuddle command Package main provides the defuddle CLI application.	Package main provides the defuddle CLI application.
examples
advanced command Package main demonstrates advanced defuddle usage.	Package main demonstrates advanced defuddle usage.
basic command Package main demonstrates basic defuddle usage.	Package main demonstrates basic defuddle usage.
custom_extractor command Package main demonstrates custom extractor usage.	Package main demonstrates custom extractor usage.
extractors command Package main demonstrates extractors usage.	Package main demonstrates extractors usage.
markdown command Package main demonstrates markdown conversion.	Package main demonstrates markdown conversion.
extractors Package extractors provides site-specific content extraction functionality.	Package extractors provides site-specific content extraction functionality.
internal
constants Package constants provides configuration constants and selectors for the defuddle content extraction system.	Package constants provides configuration constants and selectors for the defuddle content extraction system.
debug Package debug provides debugging functionality for the defuddle content extraction system.	Package debug provides debugging functionality for the defuddle content extraction system.
elements Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting	Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
markdown Package markdown provides HTML to Markdown conversion functionality.	Package markdown provides HTML to Markdown conversion functionality.
metadata Package metadata provides functionality for extracting and processing document metadata.	Package metadata provides functionality for extracting and processing document metadata.
removals Package removals provides content-pattern-based removal for the defuddle extraction pipeline.	Package removals provides content-pattern-based removal for the defuddle extraction pipeline.
scoring Package scoring provides content scoring functionality for the defuddle content extraction system.	Package scoring provides content scoring functionality for the defuddle content extraction system.
standardize Package standardize provides content standardization functionality for the defuddle content extraction system.	Package standardize provides content standardization functionality for the defuddle content extraction system.
text Package text provides text analysis utilities for content extraction.	Package text provides text analysis utilities for content extraction.
urlutil Package urlutil provides URL resolution and sanitization for extracted content.	Package urlutil provides URL resolution and sanitization for extracted content.