GoHtml

package module
v0.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 30, 2025 License: BSD-3-Clause Imports: 8 Imported by: 0

README

GoHTML

A HTML parse and a serializer for Go. GoHTML tries to keep semantic similar to JS-DOM API while trying to keep the API simple by not forcing JS-DOM model into GoHTML. Because of this GoHTML has node tree model. GoHTML tokenizer uses std net/html module for tokenizing in underlining layer. There for it's users responsibility to make sure inputs to GoHTML is UTF-8 encoded. GoHTML allows direct access to the node tree.

Installation

Run the following command in project directory in order to install.

go get github.com/udan-jayanith/GoHTML

Then GoHTML can import like this.

import (
	GoHtml "github.com/udan-jayanith/GoHTML"
)

Features

  • Parsing
  • Serialization
  • Node tree traversing
  • Querying

Example

Heres an example of fetching a website and parsing and then using querying methods. Adapted for GoQuery.

	res, err := http.Get("https://www.metalsucks.net/")
	if err != nil {
		t.Fatal(err)
	}
	defer res.Body.Close()

	node, err := GoHtml.Decode(res.Body)
	if err != nil {
		t.Fatal(err)
	}

	nodeList := node.QuerySelectorAll(".left-content article .post-title")
	for node := range nodeList.IterNodeList(){
		println(node.GetInnerText())
	}

Documentation

Fully fledged documentation is available at go.pkg

Contributions

Contributions are welcome and pull requests and issues will be viewed by an official.

Documentation

Overview

A HTML parse and a serializer for Go. GoHTML tries to keep semantic similar to JS-DOM API while trying to keep the API simple by not forcing JS-DOM model into GoHTML. Because of this GoHTML has node tree model. GoHTML tokenizer uses std net/html module for tokenizing in underlining layer. There for it's users responsibility to make sure inputs to GoHTML is UTF-8 encoded. GoHTML allows direct access to the node tree.

Index

Examples

Constants

View Source
const (
	Area   string = "area"
	Base   string = "base"
	Br     string = "br"
	Col    string = "col"
	Embed  string = "embed"
	Hr     string = "hr"
	Img    string = "img"
	Input  string = "input"
	Link   string = "link"
	Meta   string = "meta"
	Param  string = "param"
	Source string = "source"
	Track  string = "track"
	Wbr    string = "wbr"
)

Void tags

View Source
const (
	//This is not a void el. but added it anyway.
	DOCTYPEDTD string = "!DOCTYPE"
)

A DTD defines the structure and the legal elements and attributes of an XML document.

Variables

View Source
var (
	SyntaxError error = fmt.Errorf("Syntax error")
)

Functions

func Encode

func Encode(w io.Writer, rootNode *Node)

Encode writes to w encoding of the node tree from rootNode.

func IsVoidTag

func IsVoidTag(tagName string) bool

IsVoidTag returns whether the tagName is a void tag or DTD

func NodeTreeToHTML

func NodeTreeToHTML(rootNode *Node) string

NodeTreeToHTML returns encoding of node-tree as a string.

func QuerySearch added in v0.2.3

func QuerySearch(node *Node, selector string) iter.Seq[*Node]

QuerySearch search returns a iterator that traverse through the node tree from given node and passes nodes that matches the given selector.

Example
package main

import (
	"fmt"
	"net/http"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	//Request the html
	res, err := http.Get("https://example.com/")
	if err != nil || res.StatusCode != http.StatusOK {
		return
	}
	defer res.Body.Close()

	//Decode the html
	rootNode, _ := GoHtml.Decode(res.Body)

	//Iterate over every node that matches the query.
	for node := range GoHtml.QuerySearch(rootNode, ".event-columns .column .event-block h4 a") {
		//Convert the node and it's children nodes to text html and print it.
		fmt.Println(GoHtml.NodeTreeToHTML(node))
	}
}

Types

type BasicSelector added in v0.2.3

type BasicSelector int
const (
	Id BasicSelector = iota
	Class
	Tag
)

type ClassList

type ClassList struct {
	// contains filtered or unexported fields
}

func NewClassList

func NewClassList() ClassList

NewClassList returns a new empty ClassList.

func (ClassList) AppendClass

func (classList ClassList) AppendClass(className string)

AppendClass append className to classList. className that contains multiple classes is also a valid className.

func (ClassList) Contains

func (classList ClassList) Contains(className string) bool

Contains returns whether the className exists or not.

Example
package main

import (
	"fmt"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	//Creates a div that has classes video-container and main-contents
	div := GoHtml.CreateNode("div")
	div.SetAttribute("class", "video-container main-contents")

	classList := GoHtml.NewClassList()
	//Add the classes in the div to the class list
	classList.DecodeFrom(div)

	//Checks wether the following classes exists in the classList
	fmt.Println(classList.Contains("container"))
	fmt.Println(classList.Contains("video-container"))

}
Output:

false
true

func (ClassList) DecodeFrom added in v0.0.1

func (classList ClassList) DecodeFrom(node *Node)

DecodeFrom append classes in the node to classList. If node is nil SetClass does nothing.

func (ClassList) DeleteClass

func (classList ClassList) DeleteClass(className string)

DeleteClass deletes the specified classes in className.

func (ClassList) Encode

func (classList ClassList) Encode() string

Encode returns the full className.

Example
package main

import (
	"fmt"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	classList := GoHtml.NewClassList()

	//Add classes to the class list
	classList.AppendClass("container")
	classList.AppendClass("warper")
	classList.AppendClass("main-content")

	//This would output something like this "warper container main-content". Order of the output is not guaranteed.
	fmt.Println(classList.Encode())
}

func (ClassList) EncodeTo

func (classList ClassList) EncodeTo(node *Node)

EncodeTo encodes classNames for the node. If node is nil EncodeTo does nothing.

type Combinator added in v0.2.3

type Combinator int
const (
	Descendant Combinator = iota
	Child
	NextSibling
	SubsequentSibling
	//if no combinator
	NoneCombinator
)

type CombinatorEl added in v0.2.3

type CombinatorEl struct {
	Type      Combinator
	Selector1 Selector
	Selector2 Selector
}

CombinatorEl is used to represent selectors that are around a combinator.

func TokenizeSelectorsAndCombinators added in v0.2.3

func TokenizeSelectorsAndCombinators(selector string) []CombinatorEl

This takes a selector or combinators and selectors and then returns a slice of CombinatorEl.

type Node

type Node struct {
	// contains filtered or unexported fields
}

Node is a struct that represents a html elements. Nodes can have sibling nodes(NextNode and Previous Node) and child node that represent the child elements. Text is also stored as a node which can be checked by using IsTextNode method.

func CloneNode

func CloneNode(node *Node) *Node

CloneNode copy the node. But have one way connections to it's parent, next and previous nodes. If node is nil CloneNode returns nil.

func CreateNode

func CreateNode(tagName string) *Node

CreateNode returns a initialized new node.

func CreateTextNode

func CreateTextNode(text string) *Node

CreateTextNode returns a new node that represents the given text. HTML tags in text get escaped.

func Decode

func Decode(r io.Reader) (*Node, error)

Decode reads from rd and create a node-tree. Then returns the root node and nil.

Example
package main

import (
	"fmt"
	"strings"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	r := strings.NewReader(`
	<!DOCTYPE html>
	<html lang="en">
		<head>
			<meta charset="UTF-8">
			<meta name="viewport" content="width=device-width, initial-scale=1.0">
			<title>User Profile</title>
		</head>
		<body>
			<h1 class="username">Udan</h1>
			<p class="email">udanjayanith@gmail.com</p>
			<p>Joined: 01/08/2024</p>
		</body>
	</html>
	`)

	rootNode, _ := GoHtml.Decode(r)

	titleNode := rootNode.QuerySelector("title")
	title := ""
	if titleNode != nil {
		title = titleNode.GetInnerText()
	}
	fmt.Println(title)
}
Output:

User Profile

func DeepCloneNode

func DeepCloneNode(node *Node) *Node

DeepCloneNode clones the node without having references to it's original parent node, previous node and next node. If node is nil DeepCloneNode returns nil.

func HTMLToNodeTree

func HTMLToNodeTree(html string) (*Node, error)

HTMLToNodeTree return html code as a node-tree. If error were to occur it would be SyntaxError.

func (*Node) Append

func (node *Node) Append(newNode *Node)

Append inserts the newNode to end of the node chain.

func (*Node) AppendChild

func (node *Node) AppendChild(childNode *Node)

The AppendChild() method of the Node adds a node to the end of the list of children of a specified parent node.

func (*Node) AppendText

func (node *Node) AppendText(text string)

AppendText append text to the node.

func (*Node) Closest added in v0.2.3

func (node *Node) Closest(selector string) *Node

Closest traverses the node tree and its parents (heading toward the root node) until it finds a node that matches the selector and returns that node. Adapted from [https://developer.mozilla.org/en-US/docs/Web/API/Element/closest](MDN Element: closest() method)

func (*Node) GetAttribute

func (node *Node) GetAttribute(attributeName string) (string, bool)

GetAttribute returns the specified attribute value form the node. If the specified attribute doesn't exists GetAttribute returns a empty string and false.

func (*Node) GetChildNode

func (node *Node) GetChildNode() *Node

GetChildNode returns the first child node of this node.

func (*Node) GetElementByClassName added in v0.0.1

func (node *Node) GetElementByClassName(className string) *Node

GetElementByClassName returns the first node that match with the given className by advancing from the node.

func (*Node) GetElementByID added in v0.0.1

func (node *Node) GetElementByID(idName string) *Node

GetElementByID returns the first node that match with the given idName by advancing from the node.

func (*Node) GetElementByTagName added in v0.0.1

func (node *Node) GetElementByTagName(tagName string) *Node

GetElementByTagName returns the first node that match with the given tagName by advancing from the node.

func (*Node) GetElementsByClassName added in v0.0.1

func (node *Node) GetElementsByClassName(className string) NodeList

GetElementsByClassName returns a NodeList containing nodes that have the given className from the node.

func (*Node) GetElementsById added in v0.0.1

func (node *Node) GetElementsById(idName string) NodeList

GetElementsByClassName returns a NodeList containing nodes that have the given idName from the node.

func (*Node) GetElementsByTagName added in v0.0.1

func (node *Node) GetElementsByTagName(tagName string) NodeList

GetElementsByTagName returns a NodeList containing nodes that have the given tagName from the node.

func (*Node) GetFirstNode

func (node *Node) GetFirstNode() *Node

GetFirstNode returns the first node of the node branch.

func (*Node) GetInnerText

func (node *Node) GetInnerText() string

GetInnerText returns all of the text inside the node.

func (*Node) GetLastNode

func (node *Node) GetLastNode() *Node

GetLastNode returns the last node in the node branch.

func (*Node) GetNextNode

func (node *Node) GetNextNode() *Node

GetNextNode returns node next to the node.

func (*Node) GetParent

func (node *Node) GetParent() *Node

GetParent returns a pointer to the parent node.

func (*Node) GetPreviousNode

func (node *Node) GetPreviousNode() *Node

GetPreviousNode returns the previous node.

func (*Node) GetTagName

func (node *Node) GetTagName() string

Returns a string with the name of the tag for the given node.

func (*Node) GetText

func (node *Node) GetText() string

GetText returns text on the node. This does not returns text on it's child nodes. If you also wants child nodes text use GetInnerText method on the node. HTML tags in returns value get escaped.

func (*Node) IsTextNode

func (node *Node) IsTextNode() bool

IsTextNode returns a boolean value indicating node is a text node or not.

func (*Node) IterateAttributes

func (node *Node) IterateAttributes(callback func(attribute, value string))

IterateAttributes calls callback at every attribute in the node by passing attribute and value of the node.

func (*Node) QuerySelector added in v0.0.2

func (node *Node) QuerySelector(selector string) *Node

QuerySelector returns the first node that matches with the selector from the node.

Example
package main

import (
	"fmt"
	"net/http"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	res, err := http.Get("https://example.com/")
	if err != nil || res.StatusCode != http.StatusOK {
		return
	}
	defer res.Body.Close()

	rootNode, _ := GoHtml.Decode(res.Body)
	res.Body.Close()

	title := rootNode.QuerySelector("title")
	if title != nil {
		fmt.Println(title.GetInnerText())
		//Example Domain
	}
}

func (*Node) QuerySelectorAll added in v0.0.2

func (node *Node) QuerySelectorAll(selector string) NodeList

QuerySelectorAll returns a NodeList that has node that matches the selector form the node.

func (*Node) RemoveAttribute

func (node *Node) RemoveAttribute(attributeName string)

RemoveAttribute remove or delete the specified attribute.

func (*Node) RemoveNode

func (node *Node) RemoveNode()

RemoveNode removes the node from the branch safely by connecting sibling nodes.

func (*Node) SetAttribute

func (node *Node) SetAttribute(attribute, value string)

SetAttribute add a attribute to the node.

func (*Node) SetNextNode

func (node *Node) SetNextNode(nextNode *Node)

SetNextNode make nodes next node as nextNode.

func (*Node) SetPreviousNode

func (node *Node) SetPreviousNode(previousNode *Node)

SetPreviousNode sets nodes previous node to previousNode.

func (*Node) SetTagName

func (node *Node) SetTagName(tagName string)

SetTagName changes the html tag name to the tagName.

func (*Node) SetText

func (node *Node) SetText(text string)

SetText add text to the node. SetText unescapes entities like "&lt;" to become "<".

type NodeList added in v0.0.1

type NodeList struct {
	// contains filtered or unexported fields
}

NodeList can store nodes by appended order and can iterate over the node list by invoking IterNodeList method.

Example
package main

import (
	"fmt"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	nodeList := GoHtml.NewNodeList()
	nodeList.Append(GoHtml.CreateNode("br"))
	nodeList.Append(GoHtml.CreateNode("hr"))
	nodeList.Append(GoHtml.CreateNode("div"))

	iter := nodeList.IterNodeList()
	for node := range iter {
		fmt.Println(node.GetTagName())
	}
}
Output:

br
hr
div

func NewNodeList added in v0.0.1

func NewNodeList() NodeList

New returns an initialized node list.

func (*NodeList) Append added in v0.0.1

func (nl *NodeList) Append(node *Node)

Append append a node to the back of the list.

func (*NodeList) Back added in v0.0.1

func (nl *NodeList) Back() *Node

Back returns the last node of list or nil if the list is empty.

func (*NodeList) Front added in v0.0.1

func (nl *NodeList) Front() *Node

Front returns the first node of list or nil if the list is empty.

func (*NodeList) IterNodeList added in v0.0.1

func (nl *NodeList) IterNodeList() iter.Seq[*Node]

IterNodeList returns a iterator over the node list.

func (*NodeList) Len added in v0.0.1

func (nl *NodeList) Len() int

Len returns the number of node in the list. The complexity is O(1).

func (*NodeList) Next added in v0.0.1

func (nl *NodeList) Next() *Node

Next advanced to the next node and returns that node.

func (*NodeList) Previous added in v0.0.1

func (nl *NodeList) Previous() *Node

Previous advanced to the previous node and return that node.

type NodeTreeBuilder added in v0.2.3

type NodeTreeBuilder struct {
	// contains filtered or unexported fields
}

NodeTreeBuilder is used to build a node tree given a node and it's type.

func NewNodeTreeBuilder added in v0.2.3

func NewNodeTreeBuilder() NodeTreeBuilder

NewNodeTreeBuilder returns a new NodeTreeBuilder.

func (*NodeTreeBuilder) GetRootNode added in v0.2.3

func (ntb *NodeTreeBuilder) GetRootNode() *Node

GetRootNode returns the root node of the accumulated node tree and resets the NodeTreeBuilder.

func (*NodeTreeBuilder) WriteNodeTree added in v0.2.3

func (ntb *NodeTreeBuilder) WriteNodeTree(node *Node, tt html.TokenType)

WriteNodeTree append the node given html.TokenType.

type Selector added in v0.2.3

type Selector struct {
	// contains filtered or unexported fields
}

Selector struct represents a single css selector Ex: .my-class, #video, div

func NewSelector added in v0.2.3

func NewSelector(selector string) Selector

NewSelector takes a single css selector and returns a Selector struct. Selector string should be only of basic selector.

type Tokenizer added in v0.2.3

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer contains a *html.Tokenizer.

Example
package main

import (
	"fmt"
	"net/http"

	GoHtml "github.com/udan-jayanith/GoHTML"
	"golang.org/x/net/html"
)

func main() {
	//Request the html
	res, err := http.Get("https://go.dev/")
	if err != nil || res.StatusCode != http.StatusOK {
		return
	}
	defer res.Body.Close()

	//NewTokenizer takes a io.reader that receives UTF-8 encoded html code and returns a Tokenizer.
	t := GoHtml.NewTokenizer(res.Body)
	//NewNodeTreeBuilder return a new NodeTreeBuilder that can be used to build a node tree.
	nodeTreeBuilder := GoHtml.NewNodeTreeBuilder()
	for {
		//Advanced scans the next token and returns its type.
		tt := t.Advanced()
		if tt == html.ErrorToken {
			break
		}

		//WriteNodeTree takes a node and a token type. The node can be nil so if token type is EndTagToken.
		nodeTreeBuilder.WriteNodeTree(t.GetCurrentNode(), tt)
	}

	//Prints the root node of the node tree in the nodeTreeBuilder.
	fmt.Println(nodeTreeBuilder.GetRootNode())
}

func NewTokenizer added in v0.2.3

func NewTokenizer(r io.Reader) Tokenizer

NewTokenizer returns a new Tokenizer.

func (*Tokenizer) Advanced added in v0.2.3

func (t *Tokenizer) Advanced() html.TokenType

Advanced scans the next token and returns its type.

func (*Tokenizer) GetCurrentNode added in v0.2.3

func (t *Tokenizer) GetCurrentNode() *Node

CurrentNode returns the current node. Returned value can be nil regardless of token type.

type TraverseCondition

type TraverseCondition = bool
const (
	StopWalkthrough     TraverseCondition = false
	ContinueWalkthrough TraverseCondition = true
)

type Traverser

type Traverser struct {
	// contains filtered or unexported fields
}

func NewTraverser added in v0.0.1

func NewTraverser(startingNode *Node) Traverser

NewTraverser returns a new traverser that can be used to navigate the node tree.

func (*Traverser) GetCurrentNode

func (t *Traverser) GetCurrentNode() *Node

GetCurrentNode returns the current node.

func (*Traverser) Next

func (t *Traverser) Next() *Node

Next returns the node next to current node and change CurrentNode to the new node. Make sure t.currentNode is not nil otherwise program will panic.

func (*Traverser) Previous

func (t *Traverser) Previous() *Node

Previous returns the previous node and change CurrentNode to the new node. Make sure t.currentNode is not nil otherwise program will panic.

func (*Traverser) SetCurrentNodeTo

func (t *Traverser) SetCurrentNodeTo(newNode *Node)

SetCurrentNodeTo changes the current node to the newNode.

func (*Traverser) Walkthrough

func (t *Traverser) Walkthrough(callback func(node *Node) TraverseCondition)

Walkthrough traverse the node tree from the current node to the end of the node tree by visiting every node. Walkthrough traverse the node tree similar to DFS without visiting visited nodes iteratively. Walkthrough can be used as a range over iterator or a function that takes a callback and pass every node one by one.

Example
package main

import (
	"fmt"

	GoHtml "github.com/udan-jayanith/GoHTML"
)

func main() {
	//Creation of the node tree.
	body := GoHtml.CreateNode("body")
	h1 := GoHtml.CreateNode("h1")
	h1.AppendText("This is a heading")
	body.AppendChild(h1)
	p := GoHtml.CreateNode("p")
	p.AppendText("The HTML <p>tag is a fundamental element used for creating paragraphs in web development. It helps structure content, separating text into distinct blocks. When you wrap text within <p>... </p>tags, you tell browsers to treat the enclosed content as a paragraph.")
	body.AppendChild(p)

	traverser := GoHtml.NewTraverser(body)

	for node := range traverser.Walkthrough {
		fmt.Println(node)
	}
	//or
	traverser.Walkthrough(func(node *GoHtml.Node) GoHtml.TraverseCondition {
		fmt.Println(node)
		return true
	})

}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL