lolhtml

package module
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 3, 2020 License: BSD-3-Clause Imports: 6 Imported by: 0

README

go-lolhtml

GitHub Workflow Status codecov Go Report Card PkgGoDev

Go bindings for the Rust crate cloudflare/lol-html, the Low Output Latency streaming HTML rewriter/parser with CSS-selector based API, talking via cgo.

Status:

All abilities provided by lol_html's c-api are available, except for customized user data in handlers. The original tests included in c-api package have also been translated to examine this binding's functionality.

The code is at its early stage and breaking changes might be introduced. If you have any ideas on how the public API can be better structured, feel free to open a PR or an issue.

Installation

For Linux/macOS/Windows x86_64 platform users, installation is as simple as a single go get command:

$ go get github.com/coolspring8/go-lolhtml

Installing Rust is not a necessary step. That's because lol-html could be prebuilt into static libraries, stored and shipped in /build folder, so that cgo can handle other compilation matters naturally and smoothly, without intervention.

For other platforms, you will have to compile it yourself.

Features

  • Fast: A Go (cgo) wrapper built around the highly-optimized Rust HTML parsing crate lol_html.
  • Easy to use: Utilizing Go's idiomatic I/O methods, lolhtml.Writer implements io.Writer interface.

Getting Started

Now let's initialize a project and create main.go:

package main

import (
	"bytes"
	"io"
	"log"
	"os"

	"github.com/coolspring8/go-lolhtml"
)

func main() {
	chunk := []byte("Hello, <span>World</span>!")
	r := bytes.NewReader(chunk)
	w, err := lolhtml.NewWriter(
		// output to stdout
		os.Stdout,
		&lolhtml.Handlers{
			ElementContentHandler: []lolhtml.ElementContentHandler{
				{
					Selector: "span",
					ElementHandler: func(e *lolhtml.Element) lolhtml.RewriterDirective {
						err := e.SetInnerContentAsText("LOL-HTML")
						if err != nil {
							log.Fatal(err)
						}
						return lolhtml.Continue
					},
				},
			},
		},
	)
	if err != nil {
		log.Fatal(err)
	}

	// copy from the bytes reader to lolhtml writer
	_, err = io.Copy(w, r)
	if err != nil {
		log.Fatal(err)
	}

	// explicitly close the writer and flush the remaining content
	err = w.Close()
	if err != nil {
		log.Fatal(err)
	}
	// Output: Hello, <span>LOL-HTML</span>!
}

The above program creates a new Writer configured to rewrite all texts in span tags to "LOL-HTML". It takes the chunk Hello, <span>World</span>! as input, and prints the result to standard output.

And the result is Hello, <span>LOL-HTML</span>! .

Examples

example_test.go contains two examples.

For more detailed examples, please visit the /examples subdirectory.

Documentation

Available at pkg.go.dev.

Other Bindings

Versioning

This package does not really follow Semantic Versioning. The current strategy is to follow lol_html's major and minor version, and the patch version number is reserved for this binding's updates, for Go Modul to upgrade correctly.

Help Wanted!

There are a few interesting things at Projects panel that I have considered but is not yet implemented. Other contributions and suggestions are also welcome!

License

BSD 3-Clause "New" or "Revised" License

Disclaimer

This is an unofficial binding.

Cloudflare is a registered trademark of Cloudflare, Inc. Cloudflare names used in this project are for identification purposes only. The project is not associated in any way with Cloudflare Inc.

Documentation

Overview

Package lolhtml provides the ability to parse and rewrite HTML on the fly, with a CSS-selector based API.

It is a binding for the Rust crate lol_html. https://github.com/cloudflare/lol-html

Please see /examples subdirectory for more detailed examples.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrCannotGetErrorMessage = errors.New("cannot get error message from underlying lol_html lib")

ErrCannotGetErrorMessage indicates getting error code from lol_html, but unable to acquire the concrete error message.

Functions

func RewriteString added in v0.2.2

func RewriteString(s string, handlers *Handlers, config ...Config) (string, error)

RewriteString rewrites the given string with the provided Handlers and Config.

Example
output, err := lolhtml.RewriteString(
	`<div><a href="http://example.com"></a></div>`,
	&lolhtml.Handlers{
		ElementContentHandler: []lolhtml.ElementContentHandler{
			{
				Selector: "a[href]",
				ElementHandler: func(e *lolhtml.Element) lolhtml.RewriterDirective {
					href, err := e.AttributeValue("href")
					if err != nil {
						log.Fatal(err)
					}
					href = strings.ReplaceAll(href, "http:", "https:")

					err = e.SetAttribute("href", href)
					if err != nil {
						log.Fatal(err)
					}

					return lolhtml.Continue
				},
			},
		},
	},
)
if err != nil {
	log.Fatal(err)
}

fmt.Println(output)
Output:

<div><a href="https://example.com"></a></div>

Types

type Attribute

type Attribute C.lol_html_attribute_t

Attribute represents an HTML element attribute. Obtained by calling Next() on an AttributeIterator.

func (*Attribute) Name added in v0.2.1

func (a *Attribute) Name() string

Name returns the name of the attribute.

func (*Attribute) Value added in v0.2.1

func (a *Attribute) Value() string

Value returns the value of the attribute.

type AttributeIterator

type AttributeIterator C.lol_html_attributes_iterator_t

AttributeIterator can be used to iterate over all attributes of an element. The only way to get an AttributeIterator is by calling AttributeIterator() on an Element. Note the "range" syntax is not applicable here, use AttributeIterator.Next() instead.

func (*AttributeIterator) Free

func (ai *AttributeIterator) Free()

Free frees the memory held by the AttributeIterator.

func (*AttributeIterator) Next

func (ai *AttributeIterator) Next() *Attribute

Next advances the iterator and returns next attribute. Returns nil if the iterator has been exhausted.

type Comment

type Comment C.lol_html_comment_t

Comment represents an HTML comment.

func (*Comment) InsertAfterAsHTML added in v0.2.4

func (c *Comment) InsertAfterAsHTML(content string) error

InsertAfterAsHTML inserts the given content before the comment. The content is inserted as is.

func (*Comment) InsertAfterAsText added in v0.2.2

func (c *Comment) InsertAfterAsText(content string) error

InsertAfterAsText inserts the given content before the comment.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Comment) InsertBeforeAsHTML added in v0.2.4

func (c *Comment) InsertBeforeAsHTML(content string) error

InsertBeforeAsHTML inserts the given content before the comment. The content is inserted as is.

func (*Comment) InsertBeforeAsText added in v0.2.2

func (c *Comment) InsertBeforeAsText(content string) error

InsertBeforeAsText inserts the given content before the comment.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Comment) IsRemoved

func (c *Comment) IsRemoved() bool

IsRemoved returns whether the comment is removed or not.

func (*Comment) Remove

func (c *Comment) Remove()

Remove removes the comment.

func (*Comment) ReplaceAsHTML added in v0.2.4

func (c *Comment) ReplaceAsHTML(content string) error

ReplaceAsHTML replace the comment with the supplied content. The content is kept as is.

func (*Comment) ReplaceAsText added in v0.2.2

func (c *Comment) ReplaceAsText(content string) error

ReplaceAsText replace the comment with the supplied content.

The rewriter will HTML-escape the content:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Comment) SetText

func (c *Comment) SetText(text string) error

SetText sets the comment's text and returns an error if there is one.

func (*Comment) Text added in v0.2.1

func (c *Comment) Text() string

Text returns the comment's text.

type CommentHandlerFunc added in v0.2.1

type CommentHandlerFunc func(*Comment) RewriterDirective

CommentHandlerFunc is a callback handler function to do something with a Comment. Expected to return a RewriterDirective as instruction to continue or stop.

type Config

type Config struct {
	// defaults to "utf-8".
	Encoding string
	// defaults to PreallocatedParsingBufferSize: 1024, MaxAllowedMemoryUsage: 1<<63 - 1.
	Memory *MemorySettings
	// defaults to func([]byte) {}. In other words, totally discard output.
	Sink OutputSink
	// defaults to true. If true, bail out for security reasons when ambiguous.
	Strict bool
}

Config defines settings for the rewriter.

type Doctype

type Doctype C.lol_html_doctype_t

Doctype represents the document's doctype.

func (*Doctype) Name added in v0.2.1

func (d *Doctype) Name() string

Name returns doctype name.

func (*Doctype) PublicID added in v0.2.4

func (d *Doctype) PublicID() string

PublicID returns doctype public ID.

func (*Doctype) SystemID added in v0.2.4

func (d *Doctype) SystemID() string

SystemID returns doctype system ID.

type DoctypeHandlerFunc added in v0.2.1

type DoctypeHandlerFunc func(*Doctype) RewriterDirective

DoctypeHandlerFunc is a callback handler function to do something with a Comment.

type DocumentContentHandler added in v0.2.1

type DocumentContentHandler struct {
	DoctypeHandler     DoctypeHandlerFunc
	CommentHandler     CommentHandlerFunc
	TextChunkHandler   TextChunkHandlerFunc
	DocumentEndHandler DocumentEndHandlerFunc
}

DocumentContentHandler is a group of handlers that would be applied to the whole HTML document.

type DocumentEnd added in v0.2.1

type DocumentEnd C.lol_html_doc_end_t

DocumentEnd represents the end of the document.

func (*DocumentEnd) AppendAsHTML added in v0.2.4

func (d *DocumentEnd) AppendAsHTML(content string) error

AppendAsHTML appends the given content at the end of the document. The content is appended as is.

func (*DocumentEnd) AppendAsText added in v0.2.2

func (d *DocumentEnd) AppendAsText(content string) error

AppendAsText appends the given content at the end of the document.

The rewriter will HTML-escape the content before appending:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

type DocumentEndHandlerFunc added in v0.2.1

type DocumentEndHandlerFunc func(*DocumentEnd) RewriterDirective

DocumentEndHandlerFunc is a callback handler function to do something with a DocumentEnd.

type Element

type Element C.lol_html_element_t

Element represents an HTML element.

func (*Element) AttributeIterator added in v0.2.1

func (e *Element) AttributeIterator() *AttributeIterator

AttributeIterator returns a pointer to an AttributeIterator. Can be used to iterate over all attributes of the element.

func (*Element) AttributeValue added in v0.2.1

func (e *Element) AttributeValue(name string) (string, error)

AttributeValue returns the value of the attribute on this element.

func (*Element) HasAttribute

func (e *Element) HasAttribute(name string) (bool, error)

HasAttribute returns whether the element has the attribute of this name or not.

func (*Element) InsertAfterEndTagAsHTML added in v0.2.4

func (e *Element) InsertAfterEndTagAsHTML(content string) error

InsertAfterEndTagAsHTML inserts the given content after the element's end tag. The content is inserted as is.

func (*Element) InsertAfterEndTagAsText added in v0.2.2

func (e *Element) InsertAfterEndTagAsText(content string) error

InsertAfterEndTagAsText inserts the given content after the element's end tag.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) InsertAfterStartTagAsHTML added in v0.2.4

func (e *Element) InsertAfterStartTagAsHTML(content string) error

InsertAfterStartTagAsHTML inserts (prepend) the given content after the element's start tag. The content is inserted as is.

func (*Element) InsertAfterStartTagAsText added in v0.2.2

func (e *Element) InsertAfterStartTagAsText(content string) error

InsertAfterStartTagAsText inserts (prepend) the given content after the element's start tag.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) InsertBeforeEndTagAsHTML added in v0.2.4

func (e *Element) InsertBeforeEndTagAsHTML(content string) error

InsertBeforeEndTagAsHTML inserts (append) the given content before the element's end tag. The content is inserted as is.

func (*Element) InsertBeforeEndTagAsText added in v0.2.2

func (e *Element) InsertBeforeEndTagAsText(content string) error

InsertBeforeEndTagAsText inserts (append) the given content after the element's end tag.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) InsertBeforeStartTagAsHTML added in v0.2.4

func (e *Element) InsertBeforeStartTagAsHTML(content string) error

InsertBeforeStartTagAsHTML inserts the given content before the element's start tag. The content is inserted as is.

func (*Element) InsertBeforeStartTagAsText added in v0.2.2

func (e *Element) InsertBeforeStartTagAsText(content string) error

InsertBeforeStartTagAsText inserts the given content before the element's start tag.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) IsRemoved

func (e *Element) IsRemoved() bool

IsRemoved returns whether the element is removed or not.

func (*Element) NamespaceURI added in v0.2.4

func (e *Element) NamespaceURI() string

NamespaceURI gets the element's namespace URI.

func (*Element) Remove

func (e *Element) Remove()

Remove completely removes the element.

func (*Element) RemoveAndKeepContent

func (e *Element) RemoveAndKeepContent()

RemoveAndKeepContent removes the element but keeps the inner content.

func (*Element) RemoveAttribute

func (e *Element) RemoveAttribute(name string) error

RemoveAttribute removes the attribute with the name from the element.

func (*Element) ReplaceAsHTML added in v0.2.4

func (e *Element) ReplaceAsHTML(content string) error

ReplaceAsHTML replace the whole element with the supplied content. The content is kept as is.

func (*Element) ReplaceAsText added in v0.2.2

func (e *Element) ReplaceAsText(content string) error

ReplaceAsText replace the whole element with the supplied content.

The rewriter will HTML-escape the content:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) SetAttribute

func (e *Element) SetAttribute(name string, value string) error

SetAttribute updates or creates the attribute with name and value on the element.

func (*Element) SetInnerContentAsHTML added in v0.2.4

func (e *Element) SetInnerContentAsHTML(content string) error

SetInnerContentAsHTML overwrites the element's inner content. The content is kept as is.

func (*Element) SetInnerContentAsText added in v0.2.2

func (e *Element) SetInnerContentAsText(content string) error

SetInnerContentAsText overwrites the element's inner content.

The rewriter will HTML-escape the content:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*Element) SetTagName

func (e *Element) SetTagName(name string) error

SetTagName sets the element's tag name.

func (*Element) TagName added in v0.2.1

func (e *Element) TagName() string

TagName gets the element's tag name.

type ElementContentHandler added in v0.2.1

type ElementContentHandler struct {
	Selector         string
	ElementHandler   ElementHandlerFunc
	CommentHandler   CommentHandlerFunc
	TextChunkHandler TextChunkHandlerFunc
}

ElementContentHandler is a group of handlers that would be applied to the content matched by the given selector.

type ElementHandlerFunc added in v0.2.1

type ElementHandlerFunc func(*Element) RewriterDirective

ElementHandlerFunc is a callback handler function to do something with an Element.

type Handlers added in v0.2.1

type Handlers struct {
	DocumentContentHandler []DocumentContentHandler
	ElementContentHandler  []ElementContentHandler
}

Handlers contain DocumentContentHandlers and ElementContentHandlers. Can contain arbitrary numbers of them, including zero (nil slice).

type MemorySettings

type MemorySettings struct {
	PreallocatedParsingBufferSize int // defaults to 1024
	MaxAllowedMemoryUsage         int // defaults to 1<<63 -1
}

MemorySettings sets the memory limitations for the rewriter.

type OutputSink

type OutputSink func([]byte)

OutputSink is a callback function where output is written to. A byte slice is passed each time, representing a chunk of output.

Exported for special usages which require each output chunk to be identified and processed individually. For most common uses, NewWriter would be more convenient.

type RewriterDirective

type RewriterDirective int

RewriterDirective is a "status code“ that should be returned by callback handlers, to inform the rewriter to continue or stop parsing.

const (
	// Continue lets the normal parsing process continue.
	Continue RewriterDirective = iota

	// Stop stops the rewriter immediately. Content currently buffered is discarded, and an error is returned.
	// After stopping, the Writer should not be used anymore except for Close().
	Stop
)

type TextChunk

type TextChunk C.lol_html_text_chunk_t

TextChunk represents a text chunk.

func (*TextChunk) Content added in v0.2.1

func (t *TextChunk) Content() string

Content returns the text chunk's content.

func (*TextChunk) InsertAfterAsHTML added in v0.2.4

func (t *TextChunk) InsertAfterAsHTML(content string) error

InsertAfterAsHTML inserts the given content after the text chunk. The content is inserted as is.

func (*TextChunk) InsertAfterAsText added in v0.2.2

func (t *TextChunk) InsertAfterAsText(content string) error

InsertAfterAsText inserts the given content after the text chunk.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*TextChunk) InsertBeforeAsHTML added in v0.2.4

func (t *TextChunk) InsertBeforeAsHTML(content string) error

InsertBeforeAsHTML inserts the given content before the text chunk. The content is inserted as is.

func (*TextChunk) InsertBeforeAsText added in v0.2.2

func (t *TextChunk) InsertBeforeAsText(content string) error

InsertBeforeAsText inserts the given content before the text chunk.

The rewriter will HTML-escape the content before insertion:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

func (*TextChunk) IsLastInTextNode

func (t *TextChunk) IsLastInTextNode() bool

IsLastInTextNode returns whether the text chunk is the last in the text node.

func (*TextChunk) IsRemoved

func (t *TextChunk) IsRemoved() bool

IsRemoved returns whether the text chunk is removed or not.

func (*TextChunk) Remove

func (t *TextChunk) Remove()

Remove removes the text chunk.

func (*TextChunk) ReplaceAsHTML added in v0.2.4

func (t *TextChunk) ReplaceAsHTML(content string) error

ReplaceAsHTML replace the text chunk with the supplied content. The content is kept as is.

func (*TextChunk) ReplaceAsText added in v0.2.2

func (t *TextChunk) ReplaceAsText(content string) error

ReplaceAsText replace the text chunk with the supplied content.

The rewriter will HTML-escape the content:

`<` will be replaced with `&lt;`

`>` will be replaced with `&gt;`

`&` will be replaced with `&amp;`

type TextChunkHandlerFunc added in v0.2.1

type TextChunkHandlerFunc func(*TextChunk) RewriterDirective

TextChunkHandlerFunc is a callback handler function to do something with a TextChunk.

type Writer added in v0.2.1

type Writer struct {
	// contains filtered or unexported fields
}

Writer takes data written to it and writes the rewritten form of that data to an underlying writer (see NewWriter).

func NewWriter added in v0.2.1

func NewWriter(w io.Writer, handlers *Handlers, config ...Config) (*Writer, error)

NewWriter returns a new Writer with Handlers and an optional Config configured. Writes to the returned Writer are rewritten and written to w.

It is the caller's responsibility to call Close on the Writer when done. Writes may be buffered and not flushed until Close. There is no Flush method, so before using the content written by w, it is necessary to call Close to ensure w has finished writing.

Example
chunk := []byte("Hello, <span>World</span>!")
r := bytes.NewReader(chunk)
w, err := lolhtml.NewWriter(
	// output to stdout
	os.Stdout,
	&lolhtml.Handlers{
		ElementContentHandler: []lolhtml.ElementContentHandler{
			{
				Selector: "span",
				ElementHandler: func(e *lolhtml.Element) lolhtml.RewriterDirective {
					err := e.SetInnerContentAsText("LOL-HTML")
					if err != nil {
						log.Fatal(err)
					}
					return lolhtml.Continue
				},
			},
		},
	},
)
if err != nil {
	log.Fatal(err)
}

// copy from the bytes reader to lolhtml writer
_, err = io.Copy(w, r)
if err != nil {
	log.Fatal(err)
}

// explicitly close the writer and flush the remaining content
err = w.Close()
if err != nil {
	log.Fatal(err)
}
Output:

Hello, <span>LOL-HTML</span>!

func (*Writer) Close added in v0.2.4

func (w *Writer) Close() error

Close closes the Writer, flushing any unwritten data to the underlying io.Writer, but does not close the underlying io.Writer. Subsequent calls to Close is a no-op.

func (*Writer) Write added in v0.2.1

func (w *Writer) Write(p []byte) (n int, err error)

func (*Writer) WriteString added in v0.2.1

func (w *Writer) WriteString(s string) (n int, err error)

WriteString writes a string to the Writer.

Directories

Path Synopsis
examples
defer-scripts command
Usage: curl -NL https://git.io/JeOSZ | go run main.go
Usage: curl -NL https://git.io/JeOSZ | go run main.go
mixed-content-rewriter command
Usage: curl -NL https://git.io/JeOSZ | go run main.go
Usage: curl -NL https://git.io/JeOSZ | go run main.go
web-scraper command
This is a ported Go version of https://web.scraper.workers.dev/, whose source code is available at https://github.com/adamschwartz/web.scraper.workers.dev licensed under MIT.
This is a ported Go version of https://web.scraper.workers.dev/, whose source code is available at https://github.com/adamschwartz/web.scraper.workers.dev licensed under MIT.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL