htmlutil

package
v0.9.15 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 17, 2025 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package htmlutil provides HTML processing utilities for social media scraping.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ContactLinks(htmlContent, baseURL string) []string

ContactLinks extracts contact/about page URLs from HTML content. These pages often contain additional social media links.

func Description

func Description(htmlContent string) string

Description extracts the meta description from HTML content.

func EmailAddresses

func EmailAddresses(htmlContent string) []string

EmailAddresses extracts email addresses from HTML content. Filters out common false positives like noreply@, example@, etc.

func ExtractDiscordUsername added in v0.9.1

func ExtractDiscordUsername(s string) string

ExtractDiscordUsername extracts Discord usernames from text content. Returns the username in the format "username#1234" for old format or "username" for new format. Returns empty string if no Discord username is found. Requires "discord" to be mentioned in the content to avoid false positives.

func ExtractEmailFromURL

func ExtractEmailFromURL(urlStr string) (string, bool)

ExtractEmailFromURL extracts an email address from URLs like "http://user@gmail.com". Only recognizes emails at known email providers to avoid confusing HTTP basic auth URLs (like https://user@domain.com) with misformatted email addresses. Returns the email address and true if found, empty string and false otherwise.

func ExtractRedirectURL added in v0.7.9

func ExtractRedirectURL(htmlContent string) string

ExtractRedirectURL checks HTML content for meta refresh or JavaScript redirects. Returns the redirect URL if found, empty string otherwise.

func IsEmailURL

func IsEmailURL(urlStr string) bool

IsEmailURL returns true if the URL is a mailto: link or an email address with http(s):// prefix.

func OGImage added in v0.9.13

func OGImage(htmlContent string) string

OGImage extracts an image URL from HTML meta tags. Priority: og:image > twitter:image > banner/hero image in srcset.

func OGTag added in v0.9.13

func OGTag(htmlContent, property string) string

OGTag extracts an Open Graph meta tag value by property name. Handles both property="og:xxx" content="value" and content="value" property="og:xxx" orders.

func PhoneNumbers added in v0.9.13

func PhoneNumbers(htmlContent string) []string

PhoneNumbers extracts phone numbers from HTML content. Supports various formats: (555) 123-4567, 555-123-4567, +1-555-123-4567, etc.

func RelMeLinks(htmlContent string) []string

RelMeLinks extracts only links with rel="me" attribute from HTML content. These links indicate the page owner's profiles on other platforms. This is the preferred method for personal websites to avoid picking up links to collaborators, co-authors, or other people mentioned on the page.

func SocialLinks(htmlContent string) []string

SocialLinks extracts social media URLs from HTML content. WARNING: This extracts ALL social media URLs, including links to other people. For personal websites, use RelMeLinks instead to get only the page owner's profiles.

func StripTags added in v0.9.2

func StripTags(htmlContent string) string

StripTags removes HTML tags and returns plain text.

func Title

func Title(htmlContent string) string

Title extracts the title from HTML content.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL