gosoup

package module
v1.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2026 License: MIT Imports: 7 Imported by: 0

README

GoSoup

A convenient Go library for parsing and querying HTML documents, inspired by BeautifulSoup4 for Python.

GoSoup provides a simple and intuitive API for navigating and searching HTML documents. It's built on top of the golang.org/x/net/html library and offers a more user-friendly interface for common HTML parsing tasks.

Installation

go get github.com/fokitto/gosoup

Quick Start

package main

import (
	"fmt"

	"github.com/fokitto/gosoup"
)

func main() {
	html := `
	<html>
		<body>
			<div class="container">
				<h1>Hello, World!</h1>
				<p>This is a paragraph.</p>
				<p>And this is a paragraph.</p>
			</div>
		</body>
	</html>
	`

	doc, err := gosoup.ParseString(html)
	if err != nil {
		panic(err)
	}

	// Get the root element
	root := doc.Root()

	// Find the first h1 tag
	h1 := root.Find(gosoup.HasName("h1"))
	fmt.Println(h1.Text()) // Output: Hello, World!

	// Find all paragraphs without class
	paragraphs := root.FindAll(
        gosoup.All(
            gosoup.HasName("p"),
            gosoup.HasNoClass(),
        ),
    )
	for _, p := range paragraphs {
		fmt.Println(p.Text())
	}
}

API Overview

Parsing Functions
  • Parse(reader io.Reader) (*Document, error) - Parse HTML from an io.Reader
  • ParseBytes(content []byte) (*Document, error) - Parse HTML from a byte slice
  • ParseString(content string) (*Document, error) - Parse HTML from a string
Document Type

The Document struct represents a parsed HTML document and manages tag caching for efficient access.

  • Root() *Tag - Get the root HTML element of the document
Nodes

GoSoup provides a Node interface for working with different types of DOM nodes:

  • Tag - Represents an HTML element with:
    • Name - The tag name (e.g., "div", "p", "a")
    • Attrs - Map of attributes (key-value pairs)
  • NavigableString - Represents raw text content in the HTML document (similar to BeautifulSoup4's NavigableString)
Navigation Methods
  • Parent() *Tag - Get the parent tag
  • FirstChild() *Tag - Get the first child tag
  • Children() []*Tag - Get all direct child tags
  • ChildrenCount() int - Get the count of all direct child tags
  • Prev() *Tag - Get the previous sibling element
  • Next() *Tag - Get the next sibling element
  • Depth() int - Get the depth of the current tag in the document tree
  • IterNodes() iter.Seq[Node] - Iterate through all child nodes (both tags and text) using range loops
Content Methods
  • Text() string - Get the immediate text content of the tag
  • FullText(sep ...string) string - Get all text content recursively (with optional separator)
  • String() string - Render the tag and its children as HTML
Search Methods
  • Find(predicate Predicate) *Tag - Find the first element matching the predicate
  • FindAll(predicate Predicate) []*Tag - Find all elements matching the predicate
  • FindParent(predicate Predicate) *Tag - Find the first parent element matching the predicate
DOM Manipulation
  • Unwrap() Tag - Remove the tag from its parent
Working with Nodes

The IterNodes() method allows you to iterate through all child nodes, including both tags and text content:

html := `<div>Hello <b>World</b> and <i>Goodbye</i></div>`
doc, _ := gosoup.ParseString(html)
root := doc.Root()
div := root.Find(gosoup.HasName("div"))

// Iterate through all nodes (text and tags)
for node := range div.IterNodes() {
	switch n := node.(type) {
	case *gosoup.Tag:
		fmt.Printf("Tag: %s (content: %s)\n", n.Name, n.Text())
	case gosoup.NavigableString:
		fmt.Printf("Text: %q\n", n.Text)
	}
}

// Output:
// Text: "Hello "
// Tag: b (content: World)
// Text: " and "
// Tag: i (content: Goodbye)

Predicate System

To provide the flexibility similar to BeautifulSoup4, GoSoup uses a predicate system based on composable search functions. Predicates allow you to express complex selection criteria by combining simple, focused functions.

Built-in Predicates
  • HasName(name string) Predicate - Match by tag name
  • HasAttr(attr string) Predicate - Check if an attribute exists
  • HasNoAttr(attr string) Predicate - Check if an attribute does not exist
  • HasClass(class string) Predicate - Check if element has a specific CSS class
  • HasNoClass() Predicate - Check if element has no class attribute
  • AttrEq(attr, value string) Predicate - Match attribute value exactly
  • AttrContains(attr, substr string) Predicate - Match attribute value contains substring
  • AttrMatch(attr string, pattern *regexp.Regexp) Predicate - Match attribute value against regex
  • All(predicates ...Predicate) Predicate - Combine predicates with AND logic
  • Any(predicates ...Predicate) Predicate - Combine predicates with OR logic
Combining Predicates

Predicates can be combined for more complex queries:

// Find all div tags with class "container"
divs := root.FindAll(gosoup.All(
	gosoup.HasName("div"),
	gosoup.HasClass("container"),
))

// Find links that are either in the nav or have id="main-link"
links := root.FindAll(gosoup.Any(
	gosoup.AttrEq("id", "main-link"),
	gosoup.HasClass("nav"),
))
Custom Predicates

You can create your own predicates for specific use cases. A predicate is simply a function that takes a *Tag and returns a boolean:

// Define a custom predicate to find external links
isExternalLink := func(tag *gosoup.Tag) bool {
	if tag.Name != "a" {
		return false
	}
	href, ok := tag.Attrs["href"]
	return ok && strings.HasPrefix(href, "http")
}

// Use the custom predicate
externalLinks := root.FindAll(isExternalLink)

// Combine custom predicates with built-in ones
links := root.FindAll(gosoup.All(
	isExternalLink,
	gosoup.HasClass("important"),
))

Testing

Run the test suite with:

go test ./...

For coverage report:

go test -cover ./...

Notes

GoSoup uses the Parse function from golang.org/x/net/html internally. Please note the following limitations:

  • HTML that is nested deeper than 512 elements will be rejected
  • The input is assumed to be UTF-8 encoded

License

This library is open source and available under the MIT License.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Document added in v1.2.0

type Document struct {
	// contains filtered or unexported fields
}

Corresponds to HTML document

func Parse

func Parse(reader io.Reader) (*Document, error)

Parse HTML document from given reader and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func ParseBytes

func ParseBytes(content []byte) (*Document, error)

Parse given HTML document bytes and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func ParseString

func ParseString(content string) (*Document, error)

Parse given HTML document string and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func (*Document) Root added in v1.2.0

func (doc *Document) Root() *Tag

Return root tag

type NavigableString struct {
	Text string
}

Raw string node in HTML document

type Node added in v1.1.0

type Node interface {
	// contains filtered or unexported methods
}

Corresponds to HTML node in the document (tag, raw string, etc.)

type Predicate

type Predicate func(*Tag) bool

func All

func All(predicates ...Predicate) Predicate

func Any

func Any(predicates ...Predicate) Predicate

func AttrContains

func AttrContains(attr string, substr string) Predicate

func AttrEq

func AttrEq(attr string, value string) Predicate

func AttrMatch

func AttrMatch(attr string, pattern *regexp.Regexp) Predicate

func HasAttr

func HasAttr(attr string) Predicate

func HasClass

func HasClass(class string) Predicate

func HasName

func HasName(name string) Predicate

func HasNoAttr

func HasNoAttr(attr string) Predicate

func HasNoClass

func HasNoClass() Predicate

type Tag

type Tag struct {
	Name  string
	Attrs map[string]string
	// contains filtered or unexported fields
}

Corresponds to HTML tag in the document

func (*Tag) Children

func (tag *Tag) Children() []*Tag

Get all children tags recursively

func (*Tag) ChildrenCount added in v1.2.1

func (tag *Tag) ChildrenCount() int

Returns count of all inner tags

func (*Tag) Depth added in v1.2.1

func (tag *Tag) Depth() int

Returns depth of current tag

func (*Tag) Find

func (tag *Tag) Find(predicate Predicate) *Tag

Find chidl tag by predicate

func (*Tag) FindAll

func (tag *Tag) FindAll(predicate Predicate) []*Tag

Find all children tags by predicate

func (*Tag) FindParent

func (tag *Tag) FindParent(predicate Predicate) *Tag

Find parent tag by predicate

func (*Tag) FirstChild

func (tag *Tag) FirstChild() *Tag

Get a first child of a tag

func (*Tag) FullText

func (tag *Tag) FullText(sep ...string) string

Get all human-readable text of a current tree

func (*Tag) IterNodes added in v1.1.0

func (tag *Tag) IterNodes() iter.Seq[Node]

Iterate through all children nodes of current tag, including raw strings

func (*Tag) Next

func (tag *Tag) Next() *Tag

Get next sibling of tag

func (*Tag) Parent

func (tag *Tag) Parent() *Tag

Get a parent tag

func (*Tag) Prev

func (tag *Tag) Prev() *Tag

Get previous sibling of tag

func (*Tag) String

func (tag *Tag) String() string

Render a tree with a current tag as root

func (*Tag) Text

func (tag *Tag) Text() string

Get inner text of tag, without traversing inner tags

func (*Tag) Unwrap

func (tag *Tag) Unwrap()

Removes current tag from a tree

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL