gosoup

package module

v1.2.2 Latest Latest Go to latest Published: Feb 24, 2026 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fokitto/gosoup

Links

Open Source Insights

README ¶

GoSoup

A convenient Go library for parsing and querying HTML documents, inspired by BeautifulSoup4 for Python.

GoSoup provides a simple and intuitive API for navigating and searching HTML documents. It's built on top of the golang.org/x/net/html library and offers a more user-friendly interface for common HTML parsing tasks.

Installation

go get github.com/fokitto/gosoup

Quick Start

package main

import (
	"fmt"

	"github.com/fokitto/gosoup"
)

func main() {
	html := `
	<html>
		<body>
			<div class="container">
				<h1>Hello, World!</h1>
				<p>This is a paragraph.</p>
				<p>And this is a paragraph.</p>
			</div>
		</body>
	</html>
	`

	doc, err := gosoup.ParseString(html)
	if err != nil {
		panic(err)
	}

	// Get the root element
	root := doc.Root()

	// Find the first h1 tag
	h1 := root.Find(gosoup.HasName("h1"))
	fmt.Println(h1.Text()) // Output: Hello, World!

	// Find all paragraphs without class
	paragraphs := root.FindAll(
        gosoup.All(
            gosoup.HasName("p"),
            gosoup.HasNoClass(),
        ),
    )
	for _, p := range paragraphs {
		fmt.Println(p.Text())
	}
}

API Overview

Parsing Functions

Parse(reader io.Reader) (*Document, error) - Parse HTML from an io.Reader
ParseBytes(content []byte) (*Document, error) - Parse HTML from a byte slice
ParseString(content string) (*Document, error) - Parse HTML from a string

Document Type

The Document struct represents a parsed HTML document and manages tag caching for efficient access.

Root() *Tag - Get the root HTML element of the document

Nodes

GoSoup provides a Node interface for working with different types of DOM nodes:

Tag - Represents an HTML element with:
- Name - The tag name (e.g., "div", "p", "a")
- Attrs - Map of attributes (key-value pairs)
NavigableString - Represents raw text content in the HTML document (similar to BeautifulSoup4's NavigableString)

Parent() *Tag - Get the parent tag
FirstChild() *Tag - Get the first child tag
Children() []*Tag - Get all direct child tags
ChildrenCount() int - Get the count of all direct child tags
Prev() *Tag - Get the previous sibling element
Next() *Tag - Get the next sibling element
Depth() int - Get the depth of the current tag in the document tree
IterNodes() iter.Seq[Node] - Iterate through all child nodes (both tags and text) using range loops

Content Methods

Text() string - Get the immediate text content of the tag
FullText(sep ...string) string - Get all text content recursively (with optional separator)
String() string - Render the tag and its children as HTML

Search Methods

Find(predicate Predicate) *Tag - Find the first element matching the predicate
FindAll(predicate Predicate) []*Tag - Find all elements matching the predicate
FindParent(predicate Predicate) *Tag - Find the first parent element matching the predicate

DOM Manipulation

Unwrap() Tag - Remove the tag from its parent

Working with Nodes

The IterNodes() method allows you to iterate through all child nodes, including both tags and text content:

html := `<div>Hello <b>World</b> and <i>Goodbye</i></div>`
doc, _ := gosoup.ParseString(html)
root := doc.Root()
div := root.Find(gosoup.HasName("div"))

// Iterate through all nodes (text and tags)
for node := range div.IterNodes() {
	switch n := node.(type) {
	case *gosoup.Tag:
		fmt.Printf("Tag: %s (content: %s)\n", n.Name, n.Text())
	case gosoup.NavigableString:
		fmt.Printf("Text: %q\n", n.Text)
	}
}

// Output:
// Text: "Hello "
// Tag: b (content: World)
// Text: " and "
// Tag: i (content: Goodbye)

Predicate System

To provide the flexibility similar to BeautifulSoup4, GoSoup uses a predicate system based on composable search functions. Predicates allow you to express complex selection criteria by combining simple, focused functions.

Built-in Predicates

HasName(name string) Predicate - Match by tag name
HasAttr(attr string) Predicate - Check if an attribute exists
HasNoAttr(attr string) Predicate - Check if an attribute does not exist
HasClass(class string) Predicate - Check if element has a specific CSS class
HasNoClass() Predicate - Check if element has no class attribute
AttrEq(attr, value string) Predicate - Match attribute value exactly
AttrContains(attr, substr string) Predicate - Match attribute value contains substring
AttrMatch(attr string, pattern *regexp.Regexp) Predicate - Match attribute value against regex
All(predicates ...Predicate) Predicate - Combine predicates with AND logic
Any(predicates ...Predicate) Predicate - Combine predicates with OR logic

Combining Predicates

Predicates can be combined for more complex queries:

// Find all div tags with class "container"
divs := root.FindAll(gosoup.All(
	gosoup.HasName("div"),
	gosoup.HasClass("container"),
))

// Find links that are either in the nav or have id="main-link"
links := root.FindAll(gosoup.Any(
	gosoup.AttrEq("id", "main-link"),
	gosoup.HasClass("nav"),
))

Custom Predicates

You can create your own predicates for specific use cases. A predicate is simply a function that takes a *Tag and returns a boolean:

// Define a custom predicate to find external links
isExternalLink := func(tag *gosoup.Tag) bool {
	if tag.Name != "a" {
		return false
	}
	href, ok := tag.Attrs["href"]
	return ok && strings.HasPrefix(href, "http")
}

// Use the custom predicate
externalLinks := root.FindAll(isExternalLink)

// Combine custom predicates with built-in ones
links := root.FindAll(gosoup.All(
	isExternalLink,
	gosoup.HasClass("important"),
))

Testing

Run the test suite with:

go test ./...

For coverage report:

go test -cover ./...

Notes

GoSoup uses the Parse function from golang.org/x/net/html internally. Please note the following limitations:

HTML that is nested deeper than 512 elements will be rejected
The input is assumed to be UTF-8 encoded

License

This library is open source and available under the MIT License.

Documentation ¶

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Document ¶ added in v1.2.0

type Document struct {
	// contains filtered or unexported fields
}

Corresponds to HTML document

func Parse ¶

func Parse(reader io.Reader) (*Document, error)

Parse HTML document from given reader and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func ParseBytes ¶

func ParseBytes(content []byte) (*Document, error)

Parse given HTML document bytes and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func ParseString ¶

func ParseString(content string) (*Document, error)

Parse given HTML document string and return root tag. Since Parse() from the golang.org/x/net/html library is used internally, the rules for basic Parse also apply for this function:

* "Parse will reject HTML that is nested deeper than 512 elements."

* "The input is assumed to be UTF-8 encoded."

func (*Document) Root ¶ added in v1.2.0

func (doc *Document) Root() *Tag

Return root tag

type NavigableString ¶ added in v1.1.0

type NavigableString struct {
	Text string
}

Raw string node in HTML document

type Node ¶ added in v1.1.0

type Node interface {
	// contains filtered or unexported methods
}

Corresponds to HTML node in the document (tag, raw string, etc.)

type Predicate ¶

type Predicate func(*Tag) bool

func All ¶

func All(predicates ...Predicate) Predicate

func Any ¶

func Any(predicates ...Predicate) Predicate

func AttrContains ¶

func AttrContains(attr string, substr string) Predicate

func AttrEq ¶

func AttrEq(attr string, value string) Predicate

func AttrMatch ¶

func AttrMatch(attr string, pattern *regexp.Regexp) Predicate

func HasAttr ¶

func HasAttr(attr string) Predicate

func HasClass ¶

func HasClass(class string) Predicate

func HasName ¶

func HasName(name string) Predicate

func HasNoAttr ¶

func HasNoAttr(attr string) Predicate

func HasNoClass ¶

func HasNoClass() Predicate

type Tag ¶

type Tag struct {
	Name  string
	Attrs map[string]string
	// contains filtered or unexported fields
}

Corresponds to HTML tag in the document

func (*Tag) Children ¶

func (tag *Tag) Children() []*Tag

Get all children tags recursively

func (*Tag) ChildrenCount ¶ added in v1.2.1

func (tag *Tag) ChildrenCount() int

Returns count of all inner tags

func (*Tag) Depth ¶ added in v1.2.1

func (tag *Tag) Depth() int

Returns depth of current tag

func (*Tag) Find ¶

func (tag *Tag) Find(predicate Predicate) *Tag

Find chidl tag by predicate

func (*Tag) FindAll ¶

func (tag *Tag) FindAll(predicate Predicate) []*Tag

Find all children tags by predicate

func (*Tag) FindParent ¶

func (tag *Tag) FindParent(predicate Predicate) *Tag

Find parent tag by predicate

func (*Tag) FirstChild ¶

func (tag *Tag) FirstChild() *Tag

Get a first child of a tag

func (*Tag) FullText ¶

func (tag *Tag) FullText(sep ...string) string

Get all human-readable text of a current tree

func (*Tag) IterNodes ¶ added in v1.1.0

func (tag *Tag) IterNodes() iter.Seq[Node]

Iterate through all children nodes of current tag, including raw strings

func (*Tag) Next ¶

func (tag *Tag) Next() *Tag

Get next sibling of tag

func (*Tag) Parent ¶

func (tag *Tag) Parent() *Tag

Get a parent tag

func (*Tag) Prev ¶

func (tag *Tag) Prev() *Tag

Get previous sibling of tag

func (*Tag) String ¶

func (tag *Tag) String() string

Render a tree with a current tag as root

func (*Tag) Text ¶

func (tag *Tag) Text() string

Get inner text of tag, without traversing inner tags

func (*Tag) Unwrap ¶

func (tag *Tag) Unwrap()

Removes current tag from a tree

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

GoSoup

Installation

Quick Start

API Overview

Parsing Functions

Document Type

Nodes

Navigation Methods

Content Methods

Search Methods

DOM Manipulation

Working with Nodes

Predicate System

Built-in Predicates

Combining Predicates

Custom Predicates

Testing

Notes

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Document ¶ added in v1.2.0

func Parse ¶

func ParseBytes ¶

func ParseString ¶

func (*Document) Root ¶ added in v1.2.0

type NavigableString ¶ added in v1.1.0

type Node ¶ added in v1.1.0

type Predicate ¶

func All ¶

func Any ¶

func AttrContains ¶

func AttrEq ¶

func AttrMatch ¶

func HasAttr ¶

func HasClass ¶

func HasName ¶

func HasNoAttr ¶

func HasNoClass ¶

type Tag ¶

func (*Tag) Children ¶

func (*Tag) ChildrenCount ¶ added in v1.2.1

func (*Tag) Depth ¶ added in v1.2.1

func (*Tag) Find ¶

func (*Tag) FindAll ¶

func (*Tag) FindParent ¶

func (*Tag) FirstChild ¶

func (*Tag) FullText ¶

func (*Tag) IterNodes ¶ added in v1.1.0

func (*Tag) Next ¶

func (*Tag) Parent ¶

func (*Tag) Prev ¶

func (*Tag) String ¶

func (*Tag) Text ¶

func (*Tag) Unwrap ¶

Source Files ¶