htmlutil

package
v0.9.22 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 19, 2025 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package htmlutil provides HTML processing utilities for social media scraping.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ContactLinks(htmlContent, baseURL string) []string

ContactLinks extracts contact/about page URLs from HTML content. These pages often contain additional social media links.

func DecodeHTMLEntities added in v0.9.22

func DecodeHTMLEntities(s string) string

DecodeHTMLEntities decodes HTML entities in a string.

func Description

func Description(htmlContent string) string

Description extracts the meta description from HTML content.

func EmailAddresses

func EmailAddresses(htmlContent string) []string

EmailAddresses extracts email addresses from HTML content. Filters out common false positives like noreply@, example@, etc.

func ExtractDiscordUsername added in v0.9.1

func ExtractDiscordUsername(s string) string

ExtractDiscordUsername extracts Discord usernames from text content. Returns the username in the format "username#1234" for old format or "username" for new format. Returns empty string if no Discord username is found. Requires "discord" to be mentioned in the content to avoid false positives.

func ExtractEmailFromURL

func ExtractEmailFromURL(urlStr string) (string, bool)

ExtractEmailFromURL extracts an email address from URLs like "http://user@gmail.com". Only recognizes emails at known email providers to avoid confusing HTTP basic auth URLs (like https://user@domain.com) with misformatted email addresses. Returns the email address and true if found, empty string and false otherwise.

func ExtractJSONLD added in v0.9.22

func ExtractJSONLD(htmlContent string) string

ExtractJSONLD extracts JSON-LD structured data from HTML as a JSON string.

func ExtractMetaTag added in v0.9.22

func ExtractMetaTag(htmlContent, nameOrProperty string) string

ExtractMetaTag extracts a meta tag value by name or property.

func ExtractRedirectURL added in v0.7.9

func ExtractRedirectURL(htmlContent string) string

ExtractRedirectURL checks HTML content for meta refresh or JavaScript redirects. Returns the redirect URL if found, empty string otherwise.

func IsEmailURL

func IsEmailURL(urlStr string) bool

IsEmailURL returns true if the URL is a mailto: link or an email address with http(s):// prefix.

func IsGenericBio added in v0.9.22

func IsGenericBio(bio string) bool

IsGenericBio returns true if the bio looks like a generic site description rather than a specific user bio.

func IsGenericTitle added in v0.9.22

func IsGenericTitle(title string) bool

IsGenericTitle returns true if the title looks like a generic site title rather than a specific user profile title.

func IsNotFound added in v0.9.22

func IsNotFound(text string) bool

IsNotFound detects common "404 Not Found" or "Page not found" patterns in HTML content.

func OGImage added in v0.9.13

func OGImage(htmlContent string) string

OGImage extracts an image URL from HTML meta tags. Priority: og:image > twitter:image > banner/hero image in srcset.

func OGTag added in v0.9.13

func OGTag(htmlContent, property string) string

OGTag extracts an Open Graph meta tag value by property name. Handles both property="og:xxx" content="value" and content="value" property="og:xxx" orders.

func OGTitle added in v0.9.22

func OGTitle(htmlContent string) string

OGTitle extracts the og:title from HTML content.

func PhoneNumbers added in v0.9.13

func PhoneNumbers(htmlContent string) []string

PhoneNumbers extracts phone numbers from HTML content. Supports various formats: (555) 123-4567, 555-123-4567, +1-555-123-4567, etc.

func RelMeLinks(htmlContent string) []string

RelMeLinks extracts only links with rel="me" attribute from HTML content. These links indicate the page owner's profiles on other platforms. This is the preferred method for personal websites to avoid picking up links to collaborators, co-authors, or other people mentioned on the page.

func SocialLinks(htmlContent string) []string

SocialLinks extracts social media URLs from HTML content. WARNING: This extracts ALL social media URLs, including links to other people. For personal websites, use RelMeLinks instead to get only the page owner's profiles.

func StripHTML added in v0.9.22

func StripHTML(htmlContent string) string

StripHTML is an alias for StripTags for backward compatibility.

func StripTags added in v0.9.2

func StripTags(htmlContent string) string

StripTags removes HTML tags and returns plain text.

func Title

func Title(htmlContent string) string

Title extracts the title from HTML content.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL