htmldistill

command module
v0.0.0-...-47096ee Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2025 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

htmldistill is a command-line tool that extracts and distills the main content from HTML documents. It processes input from URLs, files, or standard input, removing clutter such as navigation, ads, and other non-essential elements.

Usage:

htmldistill <url1> [url2] [url3] ...

htmldistill accepts one or more URLs as arguments. For each URL, it fetches the content, processes it using the go-domdistiller library, and outputs the extracted main content as HTML to stdout.

The tool can also process local files or input from stdin by using '-' as an argument. When reading from stdin, an optional base URL can be provided to resolve relative links.

htmldistill is useful for cleaning up web content for further processing, improving readability, or preparing data for natural language processing tasks.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL