html

package

v0.176.0 Latest Latest Go to latest Published: Jun 30, 2026 License: AGPL-3.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/immanent-tech/foragd

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func CleanRedditHTML(content string) string
func DiscoverFeedURL(sourceURL *url.URL, content []byte) (string, error)
func ExtractImageFromHTML(content string) (string, string, error)
func FindAllHTMLNodes(n *html.Node, tag string) []*html.Node
func FindHTMLNode(n *html.Node, tag string) *html.Node
func FindMainImage(page []byte, rawURL string) (string, error)
func IsHTML(s string) bool
func IsHTMLElement(str, tag string) bool
func SanitizeHTMLString(rawStr string) (string, error)
func ToPlainText(s string) string
type Favicon
- func FindFavicon(page []byte, pageURL string) ([]byte, string, Favicon, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrNotFound  = errors.New("not found")
	ErrParseURL  = errors.New("could not parse URL")
	ErrParseHTML = errors.New("could not parse HTML")
)

Functions ¶

func CleanRedditHTML ¶ added in v0.160.0

func CleanRedditHTML(content string) string

CleanRedditHTML will remove the janky table format that some reddit posts are contained within.

func DiscoverFeedURL ¶ added in v0.159.0

func DiscoverFeedURL(sourceURL *url.URL, content []byte) (string, error)

DiscoverFeedURL attempts to find a feed URL within a HTML page.

There are a couple of "canonical" places the feed URL is located. Firstly, as per the RSS spec, look for a link element with rel="alternate" and type="application/rss+xml". Secondly, check for a link element with a URL that ends with feed, rss or atom, which would indicate a feed URL.

func ExtractImageFromHTML ¶ added in v0.160.0

func ExtractImageFromHTML(content string) (string, string, error)

func FindAllHTMLNodes ¶

func FindAllHTMLNodes(n *html.Node, tag string) []*html.Node

FindAllHTMLNodes returns all nodes matching the tag within n.

func FindHTMLNode ¶

func FindHTMLNode(n *html.Node, tag string) *html.Node

FindHTMLNode does a depth-first search for the first node matching the tag.

func FindMainImage ¶

func FindMainImage(page []byte, rawURL string) (string, error)

FindMainImage tries to find a "main" image for the page, using the readability parser.

func IsHTML ¶

func IsHTML(s string) bool

func IsHTMLElement ¶

func IsHTMLElement(str, tag string) bool

IsHTMLElement returns a boolean indicating whether the given string is the given HTML element.

func SanitizeHTMLString ¶ added in v0.83.0

func SanitizeHTMLString(rawStr string) (string, error)

SanitizeHTMLString will parse and re-render the given string containing HTML. In doing so, the HTML is hopefully sanitized and reformatted to be well-formed HTML.

func ToPlainText ¶ added in v0.155.0

func ToPlainText(s string) string

ToPlainText converts a HTML encoded string to plain text.

Types ¶

type Favicon ¶

type Favicon struct {
	// contains filtered or unexported fields
}

Favicon is a favicon link found in <head>.

func FindFavicon ¶

func FindFavicon(
	page []byte,
	pageURL string,
) ([]byte, string, Favicon, error)

FindFavicon tries each candidate in order and returns the first one that responds with a 2xx status and a non-empty body.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL