Documentation
¶
Overview ¶
Package htmlutil provides HTML processing utilities for social media scraping.
Index ¶
- func ContactLinks(htmlContent, baseURL string) []string
- func DecodeHTMLEntities(s string) string
- func Description(htmlContent string) string
- func EmailAddresses(htmlContent string) []string
- func ExtractDiscordUsername(s string) string
- func ExtractEmailFromURL(urlStr string) (string, bool)
- func ExtractJSONLD(htmlContent string) string
- func ExtractMetaTag(htmlContent, nameOrProperty string) string
- func ExtractRedirectURL(htmlContent string) string
- func IsEmailURL(urlStr string) bool
- func IsGenericBio(bio string) bool
- func IsGenericTitle(title string) bool
- func IsNotFound(text string) bool
- func OGImage(htmlContent string) string
- func OGTag(htmlContent, property string) string
- func OGTitle(htmlContent string) string
- func PhoneNumbers(htmlContent string) []string
- func RelMeLinks(htmlContent string) []string
- func SocialLinks(htmlContent string) []string
- func StripHTML(htmlContent string) string
- func StripTags(htmlContent string) string
- func Title(htmlContent string) string
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ContactLinks ¶
ContactLinks extracts contact/about page URLs from HTML content. These pages often contain additional social media links.
func DecodeHTMLEntities ¶ added in v0.9.22
DecodeHTMLEntities decodes HTML entities in a string.
func Description ¶
Description extracts the meta description from HTML content.
func EmailAddresses ¶
EmailAddresses extracts email addresses from HTML content. Filters out common false positives like noreply@, example@, etc.
func ExtractDiscordUsername ¶ added in v0.9.1
ExtractDiscordUsername extracts Discord usernames from text content. Returns the username in the format "username#1234" for old format or "username" for new format. Returns empty string if no Discord username is found. Requires "discord" to be mentioned in the content to avoid false positives.
func ExtractEmailFromURL ¶
ExtractEmailFromURL extracts an email address from URLs like "http://user@gmail.com". Only recognizes emails at known email providers to avoid confusing HTTP basic auth URLs (like https://user@domain.com) with misformatted email addresses. Returns the email address and true if found, empty string and false otherwise.
func ExtractJSONLD ¶ added in v0.9.22
ExtractJSONLD extracts JSON-LD structured data from HTML as a JSON string.
func ExtractMetaTag ¶ added in v0.9.22
ExtractMetaTag extracts a meta tag value by name or property.
func ExtractRedirectURL ¶ added in v0.7.9
ExtractRedirectURL checks HTML content for meta refresh or JavaScript redirects. Returns the redirect URL if found, empty string otherwise.
func IsEmailURL ¶
IsEmailURL returns true if the URL is a mailto: link or an email address with http(s):// prefix.
func IsGenericBio ¶ added in v0.9.22
IsGenericBio returns true if the bio looks like a generic site description rather than a specific user bio.
func IsGenericTitle ¶ added in v0.9.22
IsGenericTitle returns true if the title looks like a generic site title rather than a specific user profile title.
func IsNotFound ¶ added in v0.9.22
IsNotFound detects common "404 Not Found" or "Page not found" patterns in HTML content.
func OGImage ¶ added in v0.9.13
OGImage extracts an image URL from HTML meta tags. Priority: og:image > twitter:image > banner/hero image in srcset.
func OGTag ¶ added in v0.9.13
OGTag extracts an Open Graph meta tag value by property name. Handles both property="og:xxx" content="value" and content="value" property="og:xxx" orders.
func PhoneNumbers ¶ added in v0.9.13
PhoneNumbers extracts phone numbers from HTML content. Supports various formats: (555) 123-4567, 555-123-4567, +1-555-123-4567, etc.
func RelMeLinks ¶ added in v0.9.11
RelMeLinks extracts only links with rel="me" attribute from HTML content. These links indicate the page owner's profiles on other platforms. This is the preferred method for personal websites to avoid picking up links to collaborators, co-authors, or other people mentioned on the page.
func SocialLinks ¶
SocialLinks extracts social media URLs from HTML content. WARNING: This extracts ALL social media URLs, including links to other people. For personal websites, use RelMeLinks instead to get only the page owner's profiles.
Types ¶
This section is empty.