gtmlp

package module
v0.0.0-...-85e6ade Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 17, 2026 License: MIT Imports: 9 Imported by: 0

README

GTMLP - Go HTML Parsing Library

A powerful and easy-to-use Go library for parsing HTML documents with XPath selectors and pattern-based extraction.

Why GTMLP?

  • Clean Chainable API - Fluent syntax for all operations including pattern-based extraction
  • Pattern-Based Extraction - Define what to extract once, apply everywhere
  • Data Transformation Pipes - Built-in pipes for cleaning, normalizing, and transforming data
  • Production Ready - Random user-agents, retries, timeouts, error handling built-in
  • Zero External Dependencies - Pure Go implementation with minimal dependencies

Features

  • XPath Selectors - Query HTML documents using XPath 1.0 expressions
  • Pattern-Based Extraction - Container patterns for structured data extraction
  • Data Transformation Pipes - Built-in pipes (trim, decode, replace, case conversion, etc.)
  • Chainable Fluent API - Clean, composable syntax for all operations
  • HTML to JSON - Convert HTML documents to structured JSON format
  • HTTP Client - Built-in client for fetching and parsing remote URLs
  • Random User-Agents - Realistic rotating browser user-agents for anti-detection
  • XPath Validation - Validate XPath patterns before scraping
  • Alternative Patterns - Fallback patterns for robust extraction
  • URL Health Check - Check URL availability concurrently
  • XML Parsing - Parse XML documents (sitemaps, RSS feeds, SOAP responses)

Installation

go get github.com/Hanivan/gtmlp

Quick Start

Basic HTML Parsing
package main

import (
    "fmt"
    "log"
    "github.com/Hanivan/gtmlp"
)

func main() {
    html := `<html><body><h1>Hello, World!</h1></body></html>`

    parser, err := gtmlp.Parse(html)
    if err != nil {
        log.Fatal(err)
    }

    h1, _ := parser.XPath("//h1")
    fmt.Println(h1.Text()) // Output: Hello, World!
}
package main

import (
    "fmt"
    "log"
    "github.com/Hanivan/gtmlp"
)

func main() {
    // Define what to extract
    patterns := []gtmlp.PatternField{
        gtmlp.NewContainerPattern("container", "//div[@class='product']"),
        {
            Key:      "name",
            Patterns: []string{".//h2/text()"},
            Pipes:    []gtmlp.Pipe{gtmlp.NewTrimPipe()},
        },
        {
            Key:      "price",
            Patterns: []string{".//span[@class='price']/text()"},
            Pipes:    []gtmlp.Pipe{gtmlp.NewTrimPipe()},
        },
    }

    // Extract with chainable API
    results, err := gtmlp.FromURL("https://example.com/products").
        WithPatterns(patterns).
        Extract()

    if err != nil {
        log.Fatal(err)
    }

    // Use the extracted data
    for _, product := range results {
        fmt.Printf("%s: %s\n", product["name"], product["price"])
    }
}
Fluent API
// Fetch and parse URL
parser, err := gtmlp.ParseURL("https://example.com",
    gtmlp.WithTimeout(10*time.Second),
    gtmlp.WithUserAgent("MyBot/1.0"),
)

// Chain operations
text, _ := gtmlp.FromHTML(html).
    XPath("//p[@class='content']").
    Text()

// Convert to JSON
json, _ := gtmlp.New().
    FromURL("https://example.com").
    ToJSON()

Documentation

Running Examples

# Basic examples
go run examples/basic/*.go -type=all

# Advanced examples
go run examples/advanced/*.go -type=all

# Specific example
make example-chainable

See docs/EXAMPLES.md for all available examples and commands.

Anti-Detection Features

Random User-Agents are enabled by default. All HTTP requests use realistic, rotating user-agent strings from a comprehensive database of real browsers (Chrome, Firefox, Safari, Edge, etc.).

// Random UA is enabled by default (recommended)
parser, err := gtmlp.ParseURL("https://example.com")

// Use custom user-agent instead
parser, err := gtmlp.ParseURL("https://example.com",
    gtmlp.WithUserAgent("CustomBot/1.0"),
)

Benefits:

  • Avoid bot detection - Requests appear to come from real browsers
  • Reduced blocking - Websites less likely to block your scraper
  • Better success rate - More consistent scraping results
  • No extra configuration - Works out of the box

Project Structure

gtmlp/
├── internal/        # Internal packages
│   ├── parser/      # Core parsing functionality
│   ├── httpclient/  # HTTP client
│   ├── builder/     # Fluent API builder
│   ├── pipe/        # Data transformation pipes
│   ├── pattern/     # Pattern-based extraction
│   ├── validator/   # XPath validation
│   └── health/      # URL health checking
├── examples/
│   ├── basic/       # Basic examples (7 files)
│   └── advanced/    # Advanced examples (9 files)
├── docs/            # Documentation
├── types.go         # Public types and errors
├── options.go       # Configuration options
└── gtmlp.go         # Root package (public API)

Dependencies

  • github.com/antchfx/htmlquery - HTML parsing with XPath
  • github.com/antchfx/xpath - XPath 1.0 implementation
  • github.com/lib4u/fake-useragent - Realistic user-agent strings
  • golang.org/x/net - Additional networking libraries

License

MIT License

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Write tests for new features
  4. Commit your changes
  5. Push to the branch
  6. Create a Pull Request

Acknowledgments

Built with:

Documentation

Index

Constants

View Source
const (
	// ReturnTypeText returns plain text content.
	ReturnTypeText = parser.ReturnTypeText
	// ReturnTypeHTML returns HTML content.
	ReturnTypeHTML = parser.ReturnTypeHTML
)
View Source
const (
	// MultipleNone returns only the first match.
	MultipleNone = pattern.MultipleNone
	// MultipleArray returns all matches as an array.
	MultipleArray = pattern.MultipleArray
	// MultipleSpace returns all matches joined with spaces.
	MultipleSpace = pattern.MultipleSpace
	// MultipleComma returns all matches joined with commas.
	MultipleComma = pattern.MultipleComma
)

Multiple type constants.

Variables

This section is empty.

Functions

func ExtractSingle

func ExtractSingle(p *Parser, field PatternField) (any, error)

ExtractSingle extracts a single field using a pattern.

func ExtractWithPatterns

func ExtractWithPatterns(p *Parser, patterns []PatternField) ([]map[string]any, error)

ExtractWithPatterns extracts data using pattern fields.

func FromHTML

func FromHTML(html string) *builder.Builder

FromHTML creates a Builder from HTML content.

func FromURL

func FromURL(url string) *builder.Builder

FromURL creates a Builder from a URL.

func ListPipes

func ListPipes() []string

ListPipes returns all registered pipe names.

func New

func New() *builder.Builder

New creates a new Builder for fluent API usage.

func RegisterPipe

func RegisterPipe(name string, factory func() Pipe)

RegisterPipe registers a custom pipe factory.

func ToJSON

func ToJSON(html string) ([]byte, error)

ToJSON converts HTML string to JSON.

func ToJSONWithOptions

func ToJSONWithOptions(html string, opts JSONOptions) ([]byte, error)

ToJSONWithOptions converts HTML string to JSON with custom options.

func URLToJSON

func URLToJSON(url string, parseOpts []Option, jsonOpts JSONOptions) ([]byte, error)

URLToJSON fetches HTML from a URL and converts it to JSON.

Types

type Extractor

type Extractor = pattern.Extractor

Extractor is an alias for internal pattern.Extractor.

func NewExtractor

func NewExtractor(p *Parser) *Extractor

NewExtractor creates a new Extractor from a Parser.

type HealthResult

type HealthResult = health.HealthResult

HealthResult is an alias for internal health.HealthResult.

func CheckURLHealth

func CheckURLHealth(urls []string, timeout time.Duration) []HealthResult

CheckURLHealth checks the health of URLs concurrently.

func CheckURLHealthSequential

func CheckURLHealthSequential(urls []string, timeout time.Duration) []HealthResult

CheckURLHealthSequential checks URLs sequentially.

func CheckURLHealthWithGet

func CheckURLHealthWithGet(urls []string, timeout time.Duration) []HealthResult

CheckURLHealthWithGet checks URL health using GET requests.

type JSONOptions

type JSONOptions = parser.JSONOptions

JSONOptions is an alias for internal parser.JSONOptions.

func DefaultJSONOptions

func DefaultJSONOptions() JSONOptions

DefaultJSONOptions returns the default JSON conversion options.

type MultipleType

type MultipleType = pattern.MultipleType

MultipleType is an alias for internal pattern.MultipleType.

type Option

type Option func(*config)

Option is a function that configures parsing behavior.

func WithDisableRandomUA

func WithDisableRandomUA() Option

WithDisableRandomUA disables random user agents and uses static user agent.

func WithHeaders

func WithHeaders(h map[string]string) Option

WithHeaders sets custom HTTP headers.

func WithMaxRetries

func WithMaxRetries(maxRetries int) Option

WithMaxRetries sets the maximum number of retries for failed HTTP requests.

func WithProxy

func WithProxy(proxyURL string) Option

WithProxy sets a proxy URL for HTTP requests.

func WithRandomUserAgent

func WithRandomUserAgent() Option

WithRandomUserAgent explicitly enables random user agents (enabled by default).

func WithSuppressErrors

func WithSuppressErrors() Option

WithSuppressErrors enables error suppression for XPath queries. When enabled, XPath errors return nil instead of error values.

func WithTimeout

func WithTimeout(d time.Duration) Option

WithTimeout sets the HTTP request timeout.

func WithUserAgent

func WithUserAgent(ua string) Option

WithUserAgent sets the User-Agent header.

type ParseError

type ParseError struct {
	Message string
	Err     error
}

ParseError represents an error that occurred during parsing.

func NewParseError

func NewParseError(message string, err error) *ParseError

NewParseError creates a new ParseError.

func (*ParseError) Error

func (e *ParseError) Error() string

Error returns the error message.

func (*ParseError) Unwrap

func (e *ParseError) Unwrap() error

Unwrap returns the underlying error.

type Parser

type Parser = parser.Parser

Parser is an alias for internal parser.Parser.

func Parse

func Parse(html string) (*Parser, error)

Parse creates a new Parser from an HTML string.

func ParseURL

func ParseURL(url string, opts ...Option) (*Parser, error)

ParseURL fetches HTML from a URL and creates a Parser.

type PatternField

type PatternField = pattern.PatternField

PatternField is an alias for internal pattern.PatternField.

func NewContainerPattern

func NewContainerPattern(key string, xpath string) PatternField

NewContainerPattern creates a new container PatternField.

func NewPatternField

func NewPatternField(key string, xpath string) PatternField

NewPatternField creates a new PatternField.

func NewPatternFieldWithHTML

func NewPatternFieldWithHTML(key string, xpath string) PatternField

NewPatternFieldWithHTML creates a new PatternField that returns HTML content.

func NewPatternFieldWithMultiple

func NewPatternFieldWithMultiple(key string, xpath string, multiple MultipleType) PatternField

NewPatternFieldWithMultiple creates a new PatternField with multiple value handling.

type PatternMeta

type PatternMeta = pattern.PatternMeta

PatternMeta is an alias for internal pattern.PatternMeta.

func DefaultPatternMeta

func DefaultPatternMeta() *PatternMeta

DefaultPatternMeta returns default pattern metadata.

type Pipe

type Pipe = pipe.Pipe

Pipe is an alias for internal pipe.Pipe.

func CreatePipe

func CreatePipe(name string) (Pipe, error)

CreatePipe creates a pipe by name.

func NewDateFormatPipe

func NewDateFormatPipe(format string) Pipe

NewDateFormatPipe creates a new DateFormatPipe.

func NewDecodePipe

func NewDecodePipe() Pipe

NewDecodePipe creates a new DecodePipe.

func NewExtractEmailPipe

func NewExtractEmailPipe() Pipe

NewExtractEmailPipe creates a new ExtractEmailPipe.

func NewLowerCasePipe

func NewLowerCasePipe() Pipe

NewLowerCasePipe creates a new LowerCasePipe.

func NewNumberNormalizePipe

func NewNumberNormalizePipe() Pipe

NewNumberNormalizePipe creates a new NumberNormalizePipe.

func NewReplacePipe

func NewReplacePipe(pattern, with string) Pipe

NewReplacePipe creates a new ReplacePipe.

func NewTrimPipe

func NewTrimPipe() Pipe

NewTrimPipe creates a new TrimPipe.

func NewURLResolvePipe

func NewURLResolvePipe(baseURL string) Pipe

NewURLResolvePipe creates a new URLResolvePipe.

func NewUpperCasePipe

func NewUpperCasePipe() Pipe

NewUpperCasePipe creates a new UpperCasePipe.

type ReturnType

type ReturnType = parser.ReturnType

ReturnType is an alias for internal parser.ReturnType.

type Selection

type Selection = parser.Selection

Selection is an alias for internal parser.Selection.

type ValidationResult

type ValidationResult = validator.ValidationResult

ValidationResult is an alias for internal validator.ValidationResult.

func ValidateXPath

func ValidateXPath(html string, xpaths []string, suppressErrors bool) []ValidationResult

ValidateXPath validates XPath expressions against HTML.

func ValidateXPathWithParser

func ValidateXPathWithParser(p *Parser, xpaths []string) []ValidationResult

ValidateXPathWithParser validates XPath using an existing Parser.

Directories

Path Synopsis
examples
advanced command
basic command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL