README
¶
inspecthtml-go
Parse HTML and capture metadata about byte offsets.
- Reference byte and line+column offsets of tags, their attributes, and other node types.
This is implemented by monitoring the input stream, passing it to golang.org/x/net/html
, and reconstructing node metadata based on the results.
Usage
Import the module and refer to the code's documentation (pkg.go.dev).
import "github.com/dpb587/inspecthtml-go/inspecthtml"
Some sample use cases and starter snippets can be found in the examples
directory.
examples$ go run ./parse-dump <<<'<p class="headline"><strong>hello</strong><br data-example />world<!-- end-->'
<html>
<head>
</head>
<body>
// StartTagToken=L1C1:L1C21;0x0:0x14 OuterOffsets=L1C1:L2C1;0x0:0x4e InnerOffsets=L1C21:L2C1;0x14:0x4e
<p
// Attr KeyOffsets=L1C4:L1C9;0x3:0x8 ValueOffsets=L1C10:L1C20;0x9:0x13
class="headline"
>
// StartTagToken=L1C21:L1C29;0x14:0x1c OuterOffsets=L1C21:L1C43;0x14:0x2a InnerOffsets=L1C29:L1C34;0x1c:0x21
<strong>
// TextToken=L1C29:L1C34;0x1c:0x21
hello
// EndTagToken=L1C34:L1C43;0x21:0x2a
</strong>
// StartTagToken=L1C43:L1C62;0x2a:0x3d OuterOffsets=L1C43:L1C62;0x2a:0x3d
<br
// Attr KeyOffsets=L1C47:L1C59;0x2e:0x3a
data-example=""
>
</br>
// TextToken=L1C62:L1C67;0x3d:0x42
world
// CommentToken=L1C67:L1C78;0x42:0x4d
<!-- end-->
// TextToken=L1C78:L2C1;0x4d:0x4e
// EndTagToken=L2C1:L2C1;0x4e:0x4e
</p>
</body>
</html>
More complex usage can be seen from importers like rdfkit-go.
Parser
Given an io.Reader
, parse and return a standard *html.Node
as well as the resulting metadata.
parsedNode, parsedMetadata, err := inspecthtml.Parse(os.Stdin)
For any node of interest, retrieve it from the metadata provider.
nodeMetadata, hasNodeMetadata := parsedMetadata.GetNodeMetadata(node)
Always check that node or attribute metadata is available before accessing it. Specifically, keep in mind the following:
- The DOM Processor may inject elements to create a compliant HTML5 DOM Tree. Injected elements will not have any metadata since they were not present in source.
- The DOM Processor will close unclosed elements. In this case, the metadata will use a logical end tag of zero length based on the relative position of the next element.
- Node attributes may not have an offset for its value if there was no value in the source.
- Although unlikely, it is technically possible for a node attribute to be missing metadata due to the implementation method. In this case, it will be
nil
(and a test case for fixing it would be helpful).
License
Click to show internal directories.
Click to hide internal directories.