Documentation
¶
Overview ¶
Package lolhtml provides the ability to parse and rewrite HTML on the fly, with a CSS-selector based API.
It is a binding for the Rust crate lol_html. https://github.com/cloudflare/lol-html
Please see /examples subdirectory for more detailed examples.
Index ¶
- Variables
- func RewriteString(s string, handlers *Handlers, config ...Config) (string, error)
- type Attribute
- type AttributeIterator
- type Comment
- func (c *Comment) InsertAfterAsHTML(content string) error
- func (c *Comment) InsertAfterAsText(content string) error
- func (c *Comment) InsertBeforeAsHTML(content string) error
- func (c *Comment) InsertBeforeAsText(content string) error
- func (c *Comment) IsRemoved() bool
- func (c *Comment) Remove()
- func (c *Comment) ReplaceAsHTML(content string) error
- func (c *Comment) ReplaceAsText(content string) error
- func (c *Comment) SetText(text string) error
- func (c *Comment) Text() string
- type CommentHandlerFunc
- type Config
- type Doctype
- type DoctypeHandlerFunc
- type DocumentContentHandler
- type DocumentEnd
- type DocumentEndHandlerFunc
- type Element
- func (e *Element) AttributeIterator() *AttributeIterator
- func (e *Element) AttributeValue(name string) (string, error)
- func (e *Element) HasAttribute(name string) (bool, error)
- func (e *Element) InsertAfterEndTagAsHTML(content string) error
- func (e *Element) InsertAfterEndTagAsText(content string) error
- func (e *Element) InsertAfterStartTagAsHTML(content string) error
- func (e *Element) InsertAfterStartTagAsText(content string) error
- func (e *Element) InsertBeforeEndTagAsHTML(content string) error
- func (e *Element) InsertBeforeEndTagAsText(content string) error
- func (e *Element) InsertBeforeStartTagAsHTML(content string) error
- func (e *Element) InsertBeforeStartTagAsText(content string) error
- func (e *Element) IsRemoved() bool
- func (e *Element) NamespaceURI() string
- func (e *Element) Remove()
- func (e *Element) RemoveAndKeepContent()
- func (e *Element) RemoveAttribute(name string) error
- func (e *Element) ReplaceAsHTML(content string) error
- func (e *Element) ReplaceAsText(content string) error
- func (e *Element) SetAttribute(name string, value string) error
- func (e *Element) SetInnerContentAsHTML(content string) error
- func (e *Element) SetInnerContentAsText(content string) error
- func (e *Element) SetTagName(name string) error
- func (e *Element) TagName() string
- type ElementContentHandler
- type ElementHandlerFunc
- type Handlers
- type MemorySettings
- type OutputSink
- type RewriterDirective
- type TextChunk
- func (t *TextChunk) Content() string
- func (t *TextChunk) InsertAfterAsHTML(content string) error
- func (t *TextChunk) InsertAfterAsText(content string) error
- func (t *TextChunk) InsertBeforeAsHTML(content string) error
- func (t *TextChunk) InsertBeforeAsText(content string) error
- func (t *TextChunk) IsLastInTextNode() bool
- func (t *TextChunk) IsRemoved() bool
- func (t *TextChunk) Remove()
- func (t *TextChunk) ReplaceAsHTML(content string) error
- func (t *TextChunk) ReplaceAsText(content string) error
- type TextChunkHandlerFunc
- type Writer
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ErrCannotGetErrorMessage = errors.New("cannot get error message from underlying lol_html lib")
ErrCannotGetErrorMessage indicates getting error code from lol_html, but unable to acquire the concrete error message.
Functions ¶
func RewriteString ¶ added in v0.2.2
RewriteString rewrites the given string with the provided Handlers and Config.
Example ¶
output, err := lolhtml.RewriteString( `<div><a href="http://example.com"></a></div>`, &lolhtml.Handlers{ ElementContentHandler: []lolhtml.ElementContentHandler{ { Selector: "a[href]", ElementHandler: func(e *lolhtml.Element) lolhtml.RewriterDirective { href, err := e.AttributeValue("href") if err != nil { log.Fatal(err) } href = strings.ReplaceAll(href, "http:", "https:") err = e.SetAttribute("href", href) if err != nil { log.Fatal(err) } return lolhtml.Continue }, }, }, }, ) if err != nil { log.Fatal(err) } fmt.Println(output)
Output: <div><a href="https://example.com"></a></div>
Types ¶
type Attribute ¶
type Attribute C.lol_html_attribute_t
Attribute represents an HTML element attribute. Obtained by calling Next() on an AttributeIterator.
type AttributeIterator ¶
type AttributeIterator C.lol_html_attributes_iterator_t
AttributeIterator can be used to iterate over all attributes of an element. The only way to get an AttributeIterator is by calling AttributeIterator() on an Element. Note the "range" syntax is not applicable here, use AttributeIterator.Next() instead.
func (*AttributeIterator) Free ¶
func (ai *AttributeIterator) Free()
Free frees the memory held by the AttributeIterator.
func (*AttributeIterator) Next ¶
func (ai *AttributeIterator) Next() *Attribute
Next advances the iterator and returns next attribute. Returns nil if the iterator has been exhausted.
type Comment ¶
type Comment C.lol_html_comment_t
Comment represents an HTML comment.
func (*Comment) InsertAfterAsHTML ¶ added in v0.2.4
InsertAfterAsHTML inserts the given content before the comment. The content is inserted as is.
func (*Comment) InsertAfterAsText ¶ added in v0.2.2
InsertAfterAsText inserts the given content before the comment.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Comment) InsertBeforeAsHTML ¶ added in v0.2.4
InsertBeforeAsHTML inserts the given content before the comment. The content is inserted as is.
func (*Comment) InsertBeforeAsText ¶ added in v0.2.2
InsertBeforeAsText inserts the given content before the comment.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Comment) ReplaceAsHTML ¶ added in v0.2.4
ReplaceAsHTML replace the comment with the supplied content. The content is kept as is.
func (*Comment) ReplaceAsText ¶ added in v0.2.2
ReplaceAsText replace the comment with the supplied content.
The rewriter will HTML-escape the content:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
type CommentHandlerFunc ¶ added in v0.2.1
type CommentHandlerFunc func(*Comment) RewriterDirective
CommentHandlerFunc is a callback handler function to do something with a Comment. Expected to return a RewriterDirective as instruction to continue or stop.
type Config ¶
type Config struct { // defaults to "utf-8". Encoding string // defaults to PreallocatedParsingBufferSize: 1024, MaxAllowedMemoryUsage: 1<<63 - 1. Memory *MemorySettings // defaults to func([]byte) {}. In other words, totally discard output. Sink OutputSink // defaults to true. If true, bail out for security reasons when ambiguous. Strict bool }
Config defines settings for the rewriter.
type Doctype ¶
type Doctype C.lol_html_doctype_t
Doctype represents the document's doctype.
type DoctypeHandlerFunc ¶ added in v0.2.1
type DoctypeHandlerFunc func(*Doctype) RewriterDirective
DoctypeHandlerFunc is a callback handler function to do something with a Comment.
type DocumentContentHandler ¶ added in v0.2.1
type DocumentContentHandler struct { DoctypeHandler DoctypeHandlerFunc CommentHandler CommentHandlerFunc TextChunkHandler TextChunkHandlerFunc DocumentEndHandler DocumentEndHandlerFunc }
DocumentContentHandler is a group of handlers that would be applied to the whole HTML document.
type DocumentEnd ¶ added in v0.2.1
type DocumentEnd C.lol_html_doc_end_t
DocumentEnd represents the end of the document.
func (*DocumentEnd) AppendAsHTML ¶ added in v0.2.4
func (d *DocumentEnd) AppendAsHTML(content string) error
AppendAsHTML appends the given content at the end of the document. The content is appended as is.
func (*DocumentEnd) AppendAsText ¶ added in v0.2.2
func (d *DocumentEnd) AppendAsText(content string) error
AppendAsText appends the given content at the end of the document.
The rewriter will HTML-escape the content before appending:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
type DocumentEndHandlerFunc ¶ added in v0.2.1
type DocumentEndHandlerFunc func(*DocumentEnd) RewriterDirective
DocumentEndHandlerFunc is a callback handler function to do something with a DocumentEnd.
type Element ¶
type Element C.lol_html_element_t
Element represents an HTML element.
func (*Element) AttributeIterator ¶ added in v0.2.1
func (e *Element) AttributeIterator() *AttributeIterator
AttributeIterator returns a pointer to an AttributeIterator. Can be used to iterate over all attributes of the element.
func (*Element) AttributeValue ¶ added in v0.2.1
AttributeValue returns the value of the attribute on this element.
func (*Element) HasAttribute ¶
HasAttribute returns whether the element has the attribute of this name or not.
func (*Element) InsertAfterEndTagAsHTML ¶ added in v0.2.4
InsertAfterEndTagAsHTML inserts the given content after the element's end tag. The content is inserted as is.
func (*Element) InsertAfterEndTagAsText ¶ added in v0.2.2
InsertAfterEndTagAsText inserts the given content after the element's end tag.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) InsertAfterStartTagAsHTML ¶ added in v0.2.4
InsertAfterStartTagAsHTML inserts (prepend) the given content after the element's start tag. The content is inserted as is.
func (*Element) InsertAfterStartTagAsText ¶ added in v0.2.2
InsertAfterStartTagAsText inserts (prepend) the given content after the element's start tag.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) InsertBeforeEndTagAsHTML ¶ added in v0.2.4
InsertBeforeEndTagAsHTML inserts (append) the given content before the element's end tag. The content is inserted as is.
func (*Element) InsertBeforeEndTagAsText ¶ added in v0.2.2
InsertBeforeEndTagAsText inserts (append) the given content after the element's end tag.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) InsertBeforeStartTagAsHTML ¶ added in v0.2.4
InsertBeforeStartTagAsHTML inserts the given content before the element's start tag. The content is inserted as is.
func (*Element) InsertBeforeStartTagAsText ¶ added in v0.2.2
InsertBeforeStartTagAsText inserts the given content before the element's start tag.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) NamespaceURI ¶ added in v0.2.4
NamespaceURI gets the element's namespace URI.
func (*Element) RemoveAndKeepContent ¶
func (e *Element) RemoveAndKeepContent()
RemoveAndKeepContent removes the element but keeps the inner content.
func (*Element) RemoveAttribute ¶
RemoveAttribute removes the attribute with the name from the element.
func (*Element) ReplaceAsHTML ¶ added in v0.2.4
ReplaceAsHTML replace the whole element with the supplied content. The content is kept as is.
func (*Element) ReplaceAsText ¶ added in v0.2.2
ReplaceAsText replace the whole element with the supplied content.
The rewriter will HTML-escape the content:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) SetAttribute ¶
SetAttribute updates or creates the attribute with name and value on the element.
func (*Element) SetInnerContentAsHTML ¶ added in v0.2.4
SetInnerContentAsHTML overwrites the element's inner content. The content is kept as is.
func (*Element) SetInnerContentAsText ¶ added in v0.2.2
SetInnerContentAsText overwrites the element's inner content.
The rewriter will HTML-escape the content:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*Element) SetTagName ¶
SetTagName sets the element's tag name.
type ElementContentHandler ¶ added in v0.2.1
type ElementContentHandler struct { Selector string ElementHandler ElementHandlerFunc CommentHandler CommentHandlerFunc TextChunkHandler TextChunkHandlerFunc }
ElementContentHandler is a group of handlers that would be applied to the content matched by the given selector.
type ElementHandlerFunc ¶ added in v0.2.1
type ElementHandlerFunc func(*Element) RewriterDirective
ElementHandlerFunc is a callback handler function to do something with an Element.
type Handlers ¶ added in v0.2.1
type Handlers struct { DocumentContentHandler []DocumentContentHandler ElementContentHandler []ElementContentHandler }
Handlers contain DocumentContentHandlers and ElementContentHandlers. Can contain arbitrary numbers of them, including zero (nil slice).
type MemorySettings ¶
type MemorySettings struct { PreallocatedParsingBufferSize int // defaults to 1024 MaxAllowedMemoryUsage int // defaults to 1<<63 -1 }
MemorySettings sets the memory limitations for the rewriter.
type OutputSink ¶
type OutputSink func([]byte)
OutputSink is a callback function where output is written to. A byte slice is passed each time, representing a chunk of output.
Exported for special usages which require each output chunk to be identified and processed individually. For most common uses, NewWriter would be more convenient.
type RewriterDirective ¶
type RewriterDirective int
RewriterDirective is a "status code“ that should be returned by callback handlers, to inform the rewriter to continue or stop parsing.
const ( // Continue lets the normal parsing process continue. Continue RewriterDirective = iota // Stop stops the rewriter immediately. Content currently buffered is discarded, and an error is returned. // After stopping, the Writer should not be used anymore except for Close(). Stop )
type TextChunk ¶
type TextChunk C.lol_html_text_chunk_t
TextChunk represents a text chunk.
func (*TextChunk) InsertAfterAsHTML ¶ added in v0.2.4
InsertAfterAsHTML inserts the given content after the text chunk. The content is inserted as is.
func (*TextChunk) InsertAfterAsText ¶ added in v0.2.2
InsertAfterAsText inserts the given content after the text chunk.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*TextChunk) InsertBeforeAsHTML ¶ added in v0.2.4
InsertBeforeAsHTML inserts the given content before the text chunk. The content is inserted as is.
func (*TextChunk) InsertBeforeAsText ¶ added in v0.2.2
InsertBeforeAsText inserts the given content before the text chunk.
The rewriter will HTML-escape the content before insertion:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
func (*TextChunk) IsLastInTextNode ¶
IsLastInTextNode returns whether the text chunk is the last in the text node.
func (*TextChunk) ReplaceAsHTML ¶ added in v0.2.4
ReplaceAsHTML replace the text chunk with the supplied content. The content is kept as is.
func (*TextChunk) ReplaceAsText ¶ added in v0.2.2
ReplaceAsText replace the text chunk with the supplied content.
The rewriter will HTML-escape the content:
`<` will be replaced with `<`
`>` will be replaced with `>`
`&` will be replaced with `&`
type TextChunkHandlerFunc ¶ added in v0.2.1
type TextChunkHandlerFunc func(*TextChunk) RewriterDirective
TextChunkHandlerFunc is a callback handler function to do something with a TextChunk.
type Writer ¶ added in v0.2.1
type Writer struct {
// contains filtered or unexported fields
}
Writer takes data written to it and writes the rewritten form of that data to an underlying writer (see NewWriter).
func NewWriter ¶ added in v0.2.1
NewWriter returns a new Writer with Handlers and an optional Config configured. Writes to the returned Writer are rewritten and written to w.
It is the caller's responsibility to call Close on the Writer when done. Writes may be buffered and not flushed until Close. There is no Flush method, so before using the content written by w, it is necessary to call Close to ensure w has finished writing.
Example ¶
chunk := []byte("Hello, <span>World</span>!") r := bytes.NewReader(chunk) w, err := lolhtml.NewWriter( // output to stdout os.Stdout, &lolhtml.Handlers{ ElementContentHandler: []lolhtml.ElementContentHandler{ { Selector: "span", ElementHandler: func(e *lolhtml.Element) lolhtml.RewriterDirective { err := e.SetInnerContentAsText("LOL-HTML") if err != nil { log.Fatal(err) } return lolhtml.Continue }, }, }, }, ) if err != nil { log.Fatal(err) } // copy from the bytes reader to lolhtml writer _, err = io.Copy(w, r) if err != nil { log.Fatal(err) } // explicitly close the writer and flush the remaining content err = w.Close() if err != nil { log.Fatal(err) }
Output: Hello, <span>LOL-HTML</span>!
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
examples
|
|
defer-scripts
command
Usage: curl -NL https://git.io/JeOSZ | go run main.go
|
Usage: curl -NL https://git.io/JeOSZ | go run main.go |
mixed-content-rewriter
command
Usage: curl -NL https://git.io/JeOSZ | go run main.go
|
Usage: curl -NL https://git.io/JeOSZ | go run main.go |
web-scraper
command
This is a ported Go version of https://web.scraper.workers.dev/, whose source code is available at https://github.com/adamschwartz/web.scraper.workers.dev licensed under MIT.
|
This is a ported Go version of https://web.scraper.workers.dev/, whose source code is available at https://github.com/adamschwartz/web.scraper.workers.dev licensed under MIT. |