README

htmlquery

Build Status Coverage Status GoDoc Go Report Card

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

Installation

go get github.com/antchfx/htmlquery

Getting Started

Query, returns matched elements or error.
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}
Load HTML document from URL.
doc, err := htmlquery.LoadURL("http://example.com/")
Load HTML from document.
filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)
Load HTML document from string.
s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))
Find all A elements.
list := htmlquery.Find(doc, "//a")
Find all A elements that have href attribute.
list := range htmlquery.Find(doc, "//a[@href]")	
Find all A elements with href attribute and only return href value.
list := range htmlquery.Find(doc, "//a/@href")	
for n := range list{
	fmt.Println(htmlquery.InnerText(n)) // output @href value without A element.
}
Find the third A element.
a := htmlquery.FindOne(doc, "//a[3]")
Evaluate the number of all IMG element.
expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

FAQ

Find() vs QueryAll(), which is better?

Find and QueryAll both do the same things, searches all of matched html nodes. The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance
goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op
How to disable caching?
htmlquery.DisableSelectorCache = true

Changelogs

2019-11-19

  • Add built-in query object cache feature, avoid re-compilation for the same query string. #16
  • Added LoadDoc 18

2019-10-05

  • Add new methods that compatible with invalid XPath expression error: QueryAll and Query.
  • Add QuerySelector and QuerySelectorAll methods, supported reused your query object.

2019-02-04

  • #7 Removed deprecated FindEach() and FindEachWithBreak() methods.

2018-12-28

  • Avoid adding duplicate elements to list for Find() method. #6

Tutorial

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
	}
}

List of supported XPath query packages

Name Description
htmlquery XPath query package for the HTML document
xmlquery XPath query package for the XML document
jsonquery XPath query package for the JSON document

Questions

Please let me know if you have any questions.

Documentation

Overview

    Package htmlquery provides extract data from HTML documents using XPath expression.

    Index

    Constants

    This section is empty.

    Variables

    View Source
    var DisableSelectorCache = false

      DisableSelectorCache will disable caching for the query selector if value is true.

      View Source
      var SelectorCacheMaxEntries = 50

        SelectorCacheMaxEntries allows how many selector object can be caching. Default is 50. Will disable caching if SelectorCacheMaxEntries <= 0.

        Functions

        func Find

        func Find(top *html.Node, expr string) []*html.Node

          Find is like QueryAll but Will panics if the expression `expr` cannot be parsed.

          See `QueryAll()` function.

          func FindOne

          func FindOne(top *html.Node, expr string) *html.Node

            FindOne is like Query but will panics if the expression `expr` cannot be parsed. See `Query()` function.

            func InnerText

            func InnerText(n *html.Node) string

              InnerText returns the text between the start and end tags of the object.

              func LoadDoc

              func LoadDoc(path string) (*html.Node, error)

                LoadDoc loads the HTML document from the specified file path.

                func LoadURL

                func LoadURL(url string) (*html.Node, error)

                  LoadURL loads the HTML document from the specified URL.

                  func OutputHTML

                  func OutputHTML(n *html.Node, self bool) string

                    OutputHTML returns the text including tags name.

                    func Parse

                    func Parse(r io.Reader) (*html.Node, error)

                      Parse returns the parse tree for the HTML from the given Reader.

                      func Query

                      func Query(top *html.Node, expr string) (*html.Node, error)

                        Query searches the html.Node that matches by the specified XPath expr, and return the first element of matched html.Node.

                        Return an error if the expression `expr` cannot be parsed.

                        func QueryAll

                        func QueryAll(top *html.Node, expr string) ([]*html.Node, error)

                          QueryAll searches the html.Node that matches by the specified XPath expr. Return an error if the expression `expr` cannot be parsed.

                          func QuerySelector

                          func QuerySelector(top *html.Node, selector *xpath.Expr) *html.Node

                            QuerySelector returns the first matched html.Node by the specified XPath selector.

                            func QuerySelectorAll

                            func QuerySelectorAll(top *html.Node, selector *xpath.Expr) []*html.Node

                              QuerySelectorAll searches all of the html.Node that matches the specified XPath selectors.

                              func SelectAttr

                              func SelectAttr(n *html.Node, name string) (val string)

                                SelectAttr returns the attribute value with the specified name.

                                Types

                                type NodeNavigator

                                type NodeNavigator struct {
                                	// contains filtered or unexported fields
                                }

                                func CreateXPathNavigator

                                func CreateXPathNavigator(top *html.Node) *NodeNavigator

                                  CreateXPathNavigator creates a new xpath.NodeNavigator for the specified html.Node.

                                  func (*NodeNavigator) Copy

                                  func (h *NodeNavigator) Copy() xpath.NodeNavigator

                                  func (*NodeNavigator) Current

                                  func (h *NodeNavigator) Current() *html.Node

                                  func (*NodeNavigator) LocalName

                                  func (h *NodeNavigator) LocalName() string

                                  func (*NodeNavigator) MoveTo

                                  func (h *NodeNavigator) MoveTo(other xpath.NodeNavigator) bool

                                  func (*NodeNavigator) MoveToChild

                                  func (h *NodeNavigator) MoveToChild() bool

                                  func (*NodeNavigator) MoveToFirst

                                  func (h *NodeNavigator) MoveToFirst() bool

                                  func (*NodeNavigator) MoveToNext

                                  func (h *NodeNavigator) MoveToNext() bool

                                  func (*NodeNavigator) MoveToNextAttribute

                                  func (h *NodeNavigator) MoveToNextAttribute() bool

                                  func (*NodeNavigator) MoveToParent

                                  func (h *NodeNavigator) MoveToParent() bool

                                  func (*NodeNavigator) MoveToPrevious

                                  func (h *NodeNavigator) MoveToPrevious() bool

                                  func (*NodeNavigator) MoveToRoot

                                  func (h *NodeNavigator) MoveToRoot()

                                  func (*NodeNavigator) NodeType

                                  func (h *NodeNavigator) NodeType() xpath.NodeType

                                  func (*NodeNavigator) Prefix

                                  func (*NodeNavigator) Prefix() string

                                  func (*NodeNavigator) String

                                  func (h *NodeNavigator) String() string

                                  func (*NodeNavigator) Value

                                  func (h *NodeNavigator) Value() string

                                  Source Files