Documentation
¶
Overview ¶
htmldistill is a command-line tool that extracts and distills the main content from HTML documents. It processes input from URLs, files, or standard input, removing clutter such as navigation, ads, and other non-essential elements.
Usage:
htmldistill <url1> [url2] [url3] ...
htmldistill accepts one or more URLs as arguments. For each URL, it fetches the content, processes it using the go-domdistiller library, and outputs the extracted main content as HTML to stdout.
The tool can also process local files or input from stdin by using '-' as an argument. When reading from stdin, an optional base URL can be provided to resolve relative links.
htmldistill is useful for cleaning up web content for further processing, improving readability, or preparing data for natural language processing tasks.