Documentation
¶
Overview ¶
Package htmlutil provides HTML processing utilities for social media scraping.
Index ¶
- func ContactLinks(htmlContent, baseURL string) []string
- func Description(htmlContent string) string
- func EmailAddresses(htmlContent string) []string
- func ExtractDiscordUsername(s string) string
- func ExtractEmailFromURL(urlStr string) (string, bool)
- func ExtractRedirectURL(htmlContent string) string
- func IsEmailURL(urlStr string) bool
- func OGImage(htmlContent string) string
- func OGTag(htmlContent, property string) string
- func PhoneNumbers(htmlContent string) []string
- func RelMeLinks(htmlContent string) []string
- func SocialLinks(htmlContent string) []string
- func StripTags(htmlContent string) string
- func Title(htmlContent string) string
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ContactLinks ¶
ContactLinks extracts contact/about page URLs from HTML content. These pages often contain additional social media links.
func Description ¶
Description extracts the meta description from HTML content.
func EmailAddresses ¶
EmailAddresses extracts email addresses from HTML content. Filters out common false positives like noreply@, example@, etc.
func ExtractDiscordUsername ¶ added in v0.9.1
ExtractDiscordUsername extracts Discord usernames from text content. Returns the username in the format "username#1234" for old format or "username" for new format. Returns empty string if no Discord username is found. Requires "discord" to be mentioned in the content to avoid false positives.
func ExtractEmailFromURL ¶
ExtractEmailFromURL extracts an email address from URLs like "http://user@gmail.com". Only recognizes emails at known email providers to avoid confusing HTTP basic auth URLs (like https://user@domain.com) with misformatted email addresses. Returns the email address and true if found, empty string and false otherwise.
func ExtractRedirectURL ¶ added in v0.7.9
ExtractRedirectURL checks HTML content for meta refresh or JavaScript redirects. Returns the redirect URL if found, empty string otherwise.
func IsEmailURL ¶
IsEmailURL returns true if the URL is a mailto: link or an email address with http(s):// prefix.
func OGImage ¶ added in v0.9.13
OGImage extracts an image URL from HTML meta tags. Priority: og:image > twitter:image > banner/hero image in srcset.
func OGTag ¶ added in v0.9.13
OGTag extracts an Open Graph meta tag value by property name. Handles both property="og:xxx" content="value" and content="value" property="og:xxx" orders.
func PhoneNumbers ¶ added in v0.9.13
PhoneNumbers extracts phone numbers from HTML content. Supports various formats: (555) 123-4567, 555-123-4567, +1-555-123-4567, etc.
func RelMeLinks ¶ added in v0.9.11
RelMeLinks extracts only links with rel="me" attribute from HTML content. These links indicate the page owner's profiles on other platforms. This is the preferred method for personal websites to avoid picking up links to collaborators, co-authors, or other people mentioned on the page.
func SocialLinks ¶
SocialLinks extracts social media URLs from HTML content. WARNING: This extracts ALL social media URLs, including links to other people. For personal websites, use RelMeLinks instead to get only the page owner's profiles.
Types ¶
This section is empty.