parsing

package module

v1.0.7 Latest Latest Go to latest Published: Jan 12, 2023 License: GPL-3.0 Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/iwdgo/htmlutils

Links

Open Source Insights

README ¶

Exploring HTML structure

HTML is parsed using golang.org/x/net/html which produces a tree.

The module provides basic functionality to compare HTML tags or nodes and their trees. The search of an HTML tag using a *node.HTML type ignores pointers. It always returns the first match. By ignoring some properties, tags like <button> are easy to count. Text value of a tag (title, error message,...) can be checked.

Good to know

Parsing is not done according to the complete syntax checker of HTML. For instance, tags like <p> for which a closing tag would fail a comparison.

Siblings must always have the same order or comparison fails. Order of attributes is treated as irrelevant.

How to start

Detailed documentation includes examples.

Versions

v1.0.6 updates golang/go/x/net package to remove CVE-2022-27664 which does not affect x/net/html v1.0.5 requires Go 1.16+ as ioutil package use is removed.
v1.0.4 requires Go 1.17+ which implements lazy loading of modules to avoid go.mod updates.
v1.0.0 was created on Go 1.12 which supports modules.

Documentation ¶

Overview ¶

Package parsing provides basic search and comparison of HTML documents. To limit storage of references, it uses the net/html package and its Node type to structure HTML.

Search a tag in a Node with options

searching a tag based on its name whatever attributes where its type is optional
searching a tag based on its non-pointer values: type, name, attribute and namespace
comparing tags including list of attributes where order is irrelevant
comparing Node structures with an optional type

Three ways to print a node tree

select type of node and a the node value where to stop.
select type of nodes or none.
complete with indentation.

Good to know

a non-matching closed tag is one element.
a non-closed tag is closed by the following opening tag. The elements that follow are discarded as the tag is closed by the parser.

Index ¶

func AttrIncluded(m, n *html.Node) bool
func Equal(m, n *html.Node) bool
func ExploreNode(n *html.Node, s string, t html.NodeType)
func FindNode(m *html.Node, n html.Node) *html.Node
func FindTag(n *html.Node, s string, t html.NodeType) *html.Node
func FindTags(n *html.Node, s string, t html.NodeType) (a []*html.Node)
func GetText(m *html.Node, b *bytes.Buffer)
func IdenticalNodes(m, n *html.Node, t html.NodeType) *html.Node
func IncludedNode(m, n *html.Node) *html.Node
func IncludedNodeTyped(m, n *html.Node, t html.NodeType) *html.Node
func IsTextNode(b io.ReadCloser, ns *html.Node, s string) error
func IsTextTag(b io.ReadCloser, t, s string) error
func ParseFile(f string) (*html.Node, error)
func PrintData(n *html.Node) string
func PrintNodes(m, n *html.Node, t html.NodeType, d int)
func PrintTags(n *html.Node, s string, tagOnly bool)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AttrIncluded ¶

func AttrIncluded(m, n *html.Node) bool

AttrIncluded returns true if list of attributes of n is included in reference node m whatever their order.

func Equal ¶

func Equal(m, n *html.Node) bool

Equal returns true if all fields of nodes m and n are equal except pointers reflect.DeepEqual(tag1, tag2) is unusable as pointers are checked too.

func ExploreNode ¶

func ExploreNode(n *html.Node, s string, t html.NodeType)

ExploreNode prints node tags with name s and type t Without name, all tags are printed When type ErrorNode (iota == 0) prints tags of all types

Example (All) ¶

ExampleExploreNode_all prints the complete node tree.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.ExploreNode(o, "", html.ErrorNode)
}

Output:

(Document)
 html (Element)
 head (Element) body (Element)
 p (Element) [{ class ex1}]
 HTML Fragment to compare against  (Text) em (Element)
 others below (Text) to test  (Text) sub (Element)
 diffs (Text)

Example (Tags) ¶

ExampleExploreNode_tags only prints text.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
	"log"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, err := html.Parse(b) // Only place where err of Parse is checked
	if err != nil {
		log.Fatalf("parsing error:%v\n", err)
	}
	parsing.ExploreNode(o, "", html.TextNode)
}

Output:

HTML Fragment to compare against  (Text)
 others below (Text) to test  (Text)
 diffs (Text)

func FindNode ¶

func FindNode(m *html.Node, n html.Node) *html.Node

FindNode find the first occurrence of a node

func FindTag ¶

func FindTag(n *html.Node, s string, t html.NodeType) *html.Node

FindTag finds the first occurrence of a tag name (i.e. whatever its attributes). If ErrorNode is passed, any tag type will be searched.

func FindTags ¶

func FindTags(n *html.Node, s string, t html.NodeType) (a []*html.Node)

FindTags finds all occurrences of a tag name whatever their attributes. If ErrorNode is passed, any tag type will be searched.

func GetText ¶

func GetText(m *html.Node, b *bytes.Buffer)

GetText prints the text content of a tree structure like PrintNodes w/o any formatting TODO Check usage of (* Tokenizer) Text equivalent in net/html package

Example ¶

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	_, _ = fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b) // Any parsing error would occured elsewhere
	w := new(bytes.Buffer)
	parsing.GetText(o, w)
	if s := fmt.Sprint(w); s != "HTML Fragment to compare against others below to test diffs" {
		fmt.Println("incorrect text")
	}
}

func IdenticalNodes ¶

func IdenticalNodes(m, n *html.Node, t html.NodeType) *html.Node

IdenticalNodes fails if trees have different size

func IncludedNode ¶

func IncludedNode(m, n *html.Node) *html.Node

IncludedNode checks if n is included in m. Included means that the subtree is identical to m including order of siblings. If it is identical, nil is returned. Otherwise, the tag from which trees diverge is returned. If m has more tags than n, nil is returned as the search stops when one subtree exploration is exhausted.

Example ¶

ExampleIncludeNode is using the test files to demonstrate usage.

// f1 is the main table tag included in f2
toFind := html.Node{Type: html.ElementNode,
	Data: "table",
	Attr: []html.Attribute{{Namespace: "", Key: "class", Val: "fixed"}},
}
pm, _ := ParseFile(f1)
m := FindNode(pm, toFind) // searching <table> in d1
if m == nil {
	fmt.Printf("%s not found in %s \n", PrintData(&toFind), f1)
}

pn, _ := ParseFile(f2)
n := FindNode(pn, toFind) // searching <table> in d2
if n == nil {
	fmt.Printf("%s not found in %s \n", PrintData(&toFind), f2)
}
// Is n included in m
if f := IncludedNode(n, m); f != nil {
	fmt.Printf("nodes structures diverge from : %s\n", PrintData(f))
}

func IncludedNodeTyped ¶

func IncludedNodeTyped(m, n *html.Node, t html.NodeType) *html.Node

IncludedNodeTyped is like IncludeNode where only tags of type t are compared

func IsTextNode ¶

func IsTextNode(b io.ReadCloser, ns *html.Node, s string) error

IsTextNode checks the presence of a node and its text value in a buffer. An error message is returned if the node is not found or if the text is not the expected one.

func IsTextTag ¶

func IsTextTag(b io.ReadCloser, t, s string) error

IsTextTag checks the presence of a tag and its text value in a buffer. An error message is returned if the tag is not found or if the text is not the expected one.

func ParseFile ¶

func ParseFile(f string) (*html.Node, error)

ParseFile returns a *Node containing the parsed file or an error (file or parsing)

func PrintData ¶

func PrintData(n *html.Node) string

PrintData returns a string with Node information (not its relationships) nil will panic

func PrintNodes ¶

func PrintNodes(m, n *html.Node, t html.NodeType, d int)

PrintNodes prints the tree structure of node m until n node is equal. If nil is passed, the complete node is printed. Values are indented based on the recursion depth d which is usually 0 when called html.ErrorNode (iota) displays every tag except the error node.

Example (WSearch) ¶

ExamplePrintNodes_wSearch is the previous example stopping at a searched node.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)

	var tagToFind html.Node
	tagToFind.Type = html.ElementNode
	tagToFind.Data = "p"
	tagToFind.Attr = []html.Attribute{{Namespace: "", Key: "class", Val: "ex1"}}

	parsing.PrintNodes(o, &tagToFind, html.ErrorNode, 0)
}

Output:

html (Element)
. head (Element) body (Element)
.. p (Element) [{ class ex1}]
tag found: p (Element) [{ class ex1}]
... HTML Fragment to compare against  (Text) em (Element)
.... others below (Text) to test  (Text) sub (Element)
.... diffs (Text)

Example (WoSearch) ¶

ExamplePrintNodes_woSearch prints all nodes without using search.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.PrintNodes(o, nil, html.ErrorNode, 0)
}

Output:

html (Element)
. head (Element) body (Element)
.. p (Element) [{ class ex1}]
... HTML Fragment to compare against  (Text) em (Element)
.... others below (Text) to test  (Text) sub (Element)
.... diffs (Text)

func PrintTags ¶

func PrintTags(n *html.Node, s string, tagOnly bool)

PrintTags prints node structure until a tag name is found (whatever attributes) Without name, all tags are printed tagOnly selects ElementNode, otherwise tags are printed whatever type. If node tree has no Errornode, there is no difference with previous i.e. exploreNode(n, "", html.ErrorNode) prints nothing then both are equivalent.

Example (WSearch) ¶

ExamplePrintTags_wSearch is the previous example stopping at a searched tag

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)            // err ignored as failure is detected before
	parsing.PrintTags(o, "em", true) //
}

Output:

html (Element)
head (Element)
body (Element)
p (Element) [{ class ex1}]
em (Element)
[em] found. Stopping exploration
sub (Element)

Example (WoSearch) ¶

ExamplePrintTags_woSearch is not using the search part.

package main

import (
	"bytes"
	"fmt"

	parsing "github.com/iwdgo/htmlutils"
	"golang.org/x/net/html"
)

const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>`

func main() {
	b := new(bytes.Buffer)
	fmt.Fprint(b, HTMLf)
	o, _ := html.Parse(b)
	parsing.PrintTags(o, "", false) // +1,6%
}

Output:

(Document)
html (Element)
head (Element)
body (Element)
p (Element) [{ class ex1}]
HTML Fragment to compare against  (Text)
em (Element)
others below (Text)
to test  (Text)
sub (Element)
diffs (Text)

Types ¶

This section is empty.

Source Files ¶

View all Source files

parsing.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL