idr

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 16, 2020 License: MIT Imports: 16 Imported by: 7

README

IDR

IDR == Intermediate Data Representation or In-memory Data Representation

IDR is an intermediate data structure used by omniparser ingesters to store raw data read from various formats of inputs, including CSV/txt/XML/EDI/JSON/etc, and then used by schema handlers to perform transforms. It is flexible and versatile to represent all kinds of data formats supported (or to be supported) by omniparser.

Credit: The basic data structures and various operations and algorithms used by IDR are mostly inherited/adapted from, modified based on, and inspired by works done in https://github.com/antchfx/xmlquery and https://github.com/antchfx/xpath. Thank you very much!

The basic building block of an IDR is a Node and an IDR is in fact a Node tree. Each Node has two parts (see actual code here):

type Node struct {
	ID int64
	Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node
	Type NodeType
	Data string

	FormatSpecific interface{}
}

The first part of a Node contains the input format agnostic fields, such as ID, tree pointers (like Parent, FirstChild, etc), Type and Data, which we'll explain more in details later. The second part of a Node is format specific data blob. The blob not only offers a place to store format specific data it also gives IDR code and algorithms a hint on what input format the Node is about.

We'll go through each input format we support and show what its corresponding IDR looks like. But before we dive into format specific IDR, let's take a look what ID is about.

Node.ID

During ingester reading and processing, tons of Node will be allocated on heap causing lots of GCs. We introduced Node caching and recycling in this commit to alleviate GC pressure. Prior to this, sometimes, we use the memory address of a Node as a unique identifier but that trick was no longer valid, since now Node can be recycled and reused. This new int64 ID is added to help uniquely identify a Node's use, whether the Node is allocated brand new for the use or is recycled for this use. The ID value is monotonically increasing but it really doesn't matter - what it matters is each "acquisition" of a new use of a Node will be guaranteed to have a new and unique ID value.

XML

Since XML is the most complex input format we have for IDR, let's cover it first.

Here is a simple XML (from this sample):

<Root>
    <JustDate>2020/09/22</JustDate>
    <DateTimeWithNoTZ>09/22/2020 12:34:56</DateTimeWithNoTZ>
</Root>

This is a simple XML blob with no non-default namespaces and with no attributes. Its corresponding IDR looks like this (with ID, empty fields and tree pointers omitted for clarity):

Node(Type: DocumentNode, FormatSpecific: XMLSpecific())
    Node(Type: ElementNode, Data: "Root", FormatSpecific: XMLSpecific())
        Node(Type: TextNode, Data: "\n", FormatSpecific: XMLSpecific())
        Node(Type: ElementNode, Data: "JustData", FormatSpecific: XMLSpecific())
            Node(Type: TextNode, Data: "2020/09/22", FormatSpecific: XMLSpecific())
        Node(Type: TextNode, Data: "\n", FormatSpecific: XMLSpecific())
        Node(Type: ElementNode, Data: "DateTimeWithNoTZ", FormatSpecific: XMLSpecific())
            Node(Type: TextNode, Data: "09/22/2020 12:34:56", FormatSpecific: XMLSpecific())
        Node(Type: TextNode, Data: "\n", FormatSpecific: XMLSpecific())

Most of the IDR is quite self-explanatory, but what about those TextNode's with \n as Data? Turns out xml.Decoder treats anything in between two XML element nodes as text, as long as the two elements are not directly adjacent to each other. Since there is a newline '\n' after the XML element <Root> and before <JustDate>, the '\n' is captured as a TextNode.

Also note in this simple case, each of the Node has an empty but none-nil FormatSpecific, typed as XMLSpecific. XMLSpecific contains XML namespace information for each of the node, which we'll see in the next example:

<lb0:library xmlns:lb0="uri://something">
    <lb0:books>
        <book title="Harry Potter and the Philosopher's Stone">
            <author>J. K. Rowling</author>
        </book>
    </lb0:books>
</lb0:library>

In this example, we'll see how IDR deals with XML namespaces, as well as attributes.

The IDR for the example above looks like the following (note those "dummy" text nodes sprinkled in between element nodes are omitted here for clarity; also not including empty XMLSpecific):

Node(Type: DocumentNode)
    Node(Type: ElementNode, Data: "library", FormatSpecific: XMLSpecific(NamespacePrefix: "lb0", NamespaceURI: "uri://something"))
        Node(Type: ElementNode, Data: "books", FormatSpecific: XMLSpecific(NamespacePrefix: "lb0", NamespaceURI: "uri://something"))
            Node(Type: ElementNode, Data: "book")
                Node(Type: AttributeNode, Data: "title")
                    Node(Type: TextNode, Data: "Harry Potter and the Philosopher's Stone")
                Node(Type: ElementNode, Data: "author")
                    Node(Type: TextNode, Data: "J. K. Rowling")

Both Node's representing <lb0:library> and <lb0:books> include non-empty XMLSpecific's which contain their namespace prefixes and full URIs while their Node.Data contain the element names without the namespace prefixes.

Note XML attributes on elements are represented as Node's as well, with Type: AttributeNode specifically. If an attribute is namespace-prefixed, the AttributeNode typed Node will have a non-empty XMLSpecific set as well. An attribute's actual value is placed as a TextNode underneath its ElementNode. AttributeNode's are guaranteed to be placed before any other child nodes (TextNode, or ElementNode) by IDR's XML reader.

JSON

Here is a sample JSON (adapted from this sample):

{
    "order_id": "1234567",
    "items": [
        {
            "number_purchased": 5
        },
        {
            "item_price": 3.99
            "refundable": false
            "refund_id": null
        }
    ]
}

Its corresponding IDR looks like this (with ID, and tree pointers omitted for clarity):

Node(Type: DocumentNode, FormatSpecific: JSONRoot|JSONObj)
    Node(Type: ElementNode, Data: "order_id", FormatSpecific: JSONProp)
        Node(Type: TextNode, Data: "1234567", FormatSpecific: JSONValueStr)
    Node(Type: ElementNode, Data: "items", FormatSpecific: JSONProp|JSONArr)
        Node(Type: ElementNode, Data: "", FormatSpecific: JSONObj)
            Node(Type: ElementNode, Data: "number_purchased", FormatSpecific: JSONProp)
                Node(Type: TextNode, Data: "5", FormatSpecific: JSONValueNum)
        Node(Type: ElementNode, Data: "", FormatSpecific: JSONObj)
            Node(Type: ElementNode, Data: "item_price", FormatSpecific: JSONProp)
                Node(Type: TextNode, Data: "3.99", FormatSpecific: JSONValueNum)
            Node(Type: ElementNode, Data: "refundable", FormatSpecific: JSONProp)
                Node(Type: TextNode, Data: "false", FormatSpecific: JSONValueBool)
            Node(Type: ElementNode, Data: "refund_id", FormatSpecific: JSONProp)
                Node(Type: TextNode, Data: "", FormatSpecific: JSONValueNull)

For JSON IDR, the FormatSpecific field contains JSONType. JSONType are a bunch of bit-wise flags that can be combined: in this example above, the root flag is JSONRoot|JSONObj because it is the root at the same time, it is an object. If we have a JSON looks like this:

[
    ...
]

Then the root flag would be JSONRoot|JSONArr. Such combination can also be seen on items field: its flag is JSONProp|JSONArr, indicating items is a property of array value.

Similar to XML case, values (string/number/boolean/null) are added as TextNode and anchored below its corresponding ElementNode parent.

CSV (aka delimited)

Here is a sample CSV (adapted from this sample):

DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH|NOTE|LAT|LONG|UV INDEX
2019/01/31T12:34:56-0800|10.5|30.2|N|33|note 1|37.7749|122.4194|12/4/6

The omniparser builtin CSV reader will only return data rows as IDR trees, so for this example, the row 2 will be returned as:

Node(Type: DocumentNode)
    Node(Type: ElementNode, Data: "DATE")
        Node(Type: TextNode, Data: "2019/01/31T12:34:56-0800")
    Node(Type: ElementNode, Data: "HIGH_TEMP_C")
        Node(Type: TextNode, Data: "10.5")
    Node(Type: ElementNode, Data: "LOW_TEMP_F")
        Node(Type: TextNode, Data: "30.2")
    Node(Type: ElementNode, Data: "WIND_DIR")
        Node(Type: TextNode, Data: "N")
    Node(Type: ElementNode, Data: "WIND_SPEED_KMH")
        Node(Type: TextNode, Data: "33")
    Node(Type: ElementNode, Data: "NOTE")
        Node(Type: TextNode, Data: "note 1")
    Node(Type: ElementNode, Data: "LAT")
        Node(Type: TextNode, Data: "37.7749")
    Node(Type: ElementNode, Data: "LONG")
        Node(Type: TextNode, Data: "122.4194")
    Node(Type: ElementNode, Data: "UV_INDEX")
        Node(Type: TextNode, Data: "12/4/6")

Note some of the ElementNode.Data values are from its schema to avoid column name containing space, which isn't xpath query friendly.

EDI

Here is snippet from a sample EDI:

ISA*00*          *00*          *02*CPC            *ZZ*00602679321    *191103*1800*U*00401*000001644*0*P*>
GS*QM*CPC*00602679321*20191103*1800*000001644*X*004010
<omitted>
N4*HAMMER*AB*T0C1Z0*CA
<omitted>

Note EDI content, while itself is "flat", is inherently hierarchical (like XML) and its hierarchical structure needs to be described in an accompanying schema. For this example, here is a snippet of the related schema:

"segment_declarations": [
    {
        "name": "ISA",
        "child_segments": [
            {
                "name": "GS",
                "child_segments": [
                    {
                        "name": "scanInfo", "type": "segment_group", "min": 0, "max": -1, "is_target": true,
                        "child_segments": [
                            <omitted>
                            {
                                "name": "N4",
                                "elements": [
                                    { "name": "cityName", "index": 1 },
                                    { "name": "provinceCode", "index": 2 },
                                    { "name": "postalCode", "index": 3 },
                                    { "name": "countryCode", "index": 4 }
                                ]
                            },
                            <omitted>

The omniparser builtin EDI reader constructs the following IDR node(s) for each segment of data:

Node(Type: ElementNode, Data: "ISA")

This "ISA" IDR element node contains no children.

Node(Type: ElementNode, Data: "N4")
    Node(Type: ElementNode, Data: "cityName")
        Node(Type: TextNode, Data: "HAMMER")
    Node(Type: ElementNode, Data: "provinceCode")
        Node(Type: TextNode, Data: "AB")
    Node(Type: ElementNode, Data: "postalCode")
        Node(Type: TextNode, Data: "T0C1Z0")
    Node(Type: ElementNode, Data: "countryCode")
        Node(Type: TextNode, Data: "CA")

This "N4" IDR element node contains multiple child element nodes for each of the element appearances in the EDI content.

Fixed-Length (mostly TXT)

In fixed-length files, we have the concept of an 'envelope'. An envelope can contain a single line of text, or a fixed number of consecutive lines, or multiple lines grouped together by a header and a footer. An envelope is the basic unit of record which the fixed-length reader returns to the parser.

Here is a sample single-line-envelope TXT (adapted from this sample):

...
2019/01/31T12:34:56-0800 10.5 30.2  N 33  37.7749 122.4194
...

The omniparser builtin fixed-length reader returns a fixed-length envelope as a IDR tree:

Node(Type: ElementNode)
    Node(Type: ElementNode, Data: "DATE")
        Node(Type: TextNode, Data: "2019/01/31T12:34:56-0800")
    Node(Type: ElementNode, Data: "HIGH_TEMP_C")
        Node(Type: TextNode, Data: "10.5")
    Node(Type: ElementNode, Data: "LOW_TEMP_F")
        Node(Type: TextNode, Data: "30.2")
    Node(Type: ElementNode, Data: "WIND_DIR")
        Node(Type: TextNode, Data: "N")
    Node(Type: ElementNode, Data: "WIND_SPEED_KMH")
        Node(Type: TextNode, Data: "33")
    Node(Type: ElementNode, Data: "LAT")
        Node(Type: TextNode, Data: "37.7749")
    Node(Type: ElementNode, Data: "LONG")
        Node(Type: TextNode, Data: "122.4194")

Documentation

Index

Constants

View Source
const (
	// DisableXPathCache disables caching xpath compilation when MatchAll/MatchSingle
	// are called. Useful when caller knows the xpath string isn't cache-able (such as
	// containing unique IDs, timestamps, etc) which would otherwise cause the xpath
	// compilation cache grow unbounded.
	DisableXPathCache = uint(1) << iota
)

Variables

View Source
var (
	// ErrNoMatch is returned when not a single matched node can be found.
	ErrNoMatch = errors.New("no match")
	// ErrMoreThanExpected is returned when more than expected matched nodes are found.
	ErrMoreThanExpected = errors.New("more than expected matched")
)

Functions

func AddChild

func AddChild(parent, n *Node)

AddChild adds 'n' as the new last child to 'parent'.

func IsJSON

func IsJSON(n *Node) bool

IsJSON checks if a Node is of JSON.

func IsJSONArr

func IsJSONArr(n *Node) bool

IsJSONArr checks if a given Node is of JSONArr type.

func IsJSONObj

func IsJSONObj(n *Node) bool

IsJSONObj checks if a given Node is of JSONObj type.

func IsJSONProp

func IsJSONProp(n *Node) bool

IsJSONProp checks if a given Node is of JSONProp type.

func IsJSONRoot

func IsJSONRoot(n *Node) bool

IsJSONRoot checks if a given Node is of JSONRoot type.

func IsJSONValue

func IsJSONValue(n *Node) bool

IsJSONValue checks if a given Node is of any JSON value types.

func IsJSONValueBool

func IsJSONValueBool(n *Node) bool

IsJSONValueBool checks if a given Node is of JSONValueBool type.

func IsJSONValueNull

func IsJSONValueNull(n *Node) bool

IsJSONValueNull checks if a given Node is of JSONValueNull type.

func IsJSONValueNum

func IsJSONValueNum(n *Node) bool

IsJSONValueNum checks if a given Node is of JSONValueNum type.

func IsJSONValueStr

func IsJSONValueStr(n *Node) bool

IsJSONValueStr checks if a given Node is of JSONValueStr type.

func IsXML

func IsXML(n *Node) bool

IsXML checks if a Node is of XML.

func J2NodeToInterface

func J2NodeToInterface(n *Node, useJSONType bool) interface{}

J2NodeToInterface translate an *idr.Node and its subtree into a JSON-marshaling friendly interface{}.

func JSONify1

func JSONify1(n *Node) string

JSONify1 json marshals a *Node verbatim. Mostly used in test for snapshotting.

func JSONify2

func JSONify2(n *Node) string

JSONify2 JSON marshals a *Node into a minified JSON string.

func MatchAny

func MatchAny(n *Node, expr *xpath.Expr) bool

MatchAny returns true if the xpath query 'expr' against an IDR tree rooted at 'n' yields any result.

func QueryIter

func QueryIter(n *Node, expr *xpath.Expr) *xpath.NodeIterator

QueryIter initiates an xpath query specified by 'expr' against an IDR tree rooted at 'n'.

func RemoveAndReleaseTree

func RemoveAndReleaseTree(n *Node)

RemoveAndReleaseTree removes a node and its subtree from an IDR tree it is in and release the resources (Node allocation) associated with the node and its subtree.

Types

type JSONStreamReader

type JSONStreamReader struct {
	// contains filtered or unexported fields
}

JSONStreamReader is a streaming JSON to *Node reader.

func NewJSONStreamReader

func NewJSONStreamReader(r io.Reader, xpathStr string) (*JSONStreamReader, error)

NewJSONStreamReader creates a new instance of JSON streaming reader.

func (*JSONStreamReader) AtLine

func (sp *JSONStreamReader) AtLine() int

AtLine returns the **rough** line number of the current JSON decoder.

func (*JSONStreamReader) Read

func (sp *JSONStreamReader) Read() (*Node, error)

Read returns a *Node that matches the xpath streaming criteria.

func (*JSONStreamReader) Release

func (sp *JSONStreamReader) Release(n *Node)

Release releases the *Node (and its subtree) that Read() has previously returned. Note even if Release is not explicitly called, next Read() call will still release the current streaming target node.

type JSONType

type JSONType uint

JSONType is the type of a JSON-specific Node. Note multiple JSONType can be bit-wise OR'ed together.

const (
	// JSONRoot is the type for the root Node in a JSON IDR tree.
	JSONRoot JSONType = 1 << iota
	// JSONObj is the type for a Node in a JSON IDR tree whose value is an object.
	JSONObj
	// JSONArr is the type for a Node in a JSON IDR tree whose value is an array.
	JSONArr
	// JSONProp is the type for a Node in a JSON IDR tree who is a property.
	JSONProp
	// JSONValueStr is the type for a Node in a JSON IDR tree who is a string value.
	JSONValueStr
	// JSONValueNum is the type for a Node in a JSON IDR tree who is a numeric value.
	JSONValueNum
	// JSONValueBool is the type for a Node in a JSON IDR tree who is a boolean value.
	JSONValueBool
	// JSONValueNull is the type for a Node in a JSON IDR tree who is a null value.
	JSONValueNull
)

func JSONTypeOf

func JSONTypeOf(n *Node) JSONType

JSONTypeOf returns the JSONType of a Node. Note if the Node isn't of JSON, this function will panic.

func (JSONType) String

func (jt JSONType) String() string

String converts JSONType to a string.

type Node

type Node struct {
	// ID uniquely identifies a Node, whether it's newly created or recycled and reused from
	// the node allocation cache. Previously we sometimes used a *Node's pointer address as a
	// unique ID which isn't sufficiently unique anymore given the introduction of using
	// sync.Pool for node allocation caching.
	ID int64

	Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node

	Type NodeType
	Data string

	FormatSpecific interface{}
}

Node represents a node of element/data in an IDR (intermediate data representation) ingested and created by the omniparser. Credit: this is by and large a copy and some adaptation from https://github.com/antchfx/xmlquery/blob/master/node.go. The reasons we want to have our own struct:

  • more stability
  • one struct to represent XML/JSON/EDI/CSV/txt/etc. Vs antchfx's work have one struct (in each repo) for each format.
  • Node allocation recycling.

func CreateJSONNode

func CreateJSONNode(ntype NodeType, data string, jtype JSONType) *Node

CreateJSONNode creates a JSON Node.

func CreateNode

func CreateNode(ntype NodeType, data string) *Node

CreateNode creates a generic *Node.

func CreateXMLNode

func CreateXMLNode(ntype NodeType, data string, xmlSpecific XMLSpecific) *Node

CreateXMLNode creates an XML Node.

func MatchAll

func MatchAll(n *Node, exprStr string, flags ...uint) ([]*Node, error)

MatchAll returns all the matched nodes by an xpath query 'exprStr' against an IDR tree rooted at 'n'.

func MatchSingle

func MatchSingle(n *Node, exprStr string, flags ...uint) (*Node, error)

MatchSingle returns one and only one matched node by an xpath query 'exprStr' against an IDR tree rooted at 'n'. If no matching node is found, ErrNoMatch is returned; if more than one matching nodes are found, ErrMoreThanExpected is returned.

func (*Node) InnerText

func (n *Node) InnerText() string

InnerText returns a Node's children's texts concatenated. Note (in an XML IDR tree) none of the AttributeNode's text will be included.

type NodeType

type NodeType uint

NodeType is the type of a Node in an IDR.

const (
	// DocumentNode is the type of the root Node in an IDR tree.
	DocumentNode NodeType = iota
	// ElementNode is the type of an element Node in an IDR tree.
	ElementNode
	// TextNode is the type of an text/data Node in an IDR tree.
	TextNode
	// AttributeNode is the type of an attribute Node in an IDR tree.
	AttributeNode
)

func (NodeType) String

func (nt NodeType) String() string

String converts NodeType to a string.

type XMLSpecific

type XMLSpecific struct {
	NamespacePrefix string
	NamespaceURI    string
}

XMLSpecific contains XML IDR Node specific information such as namespace.

func XMLSpecificOf

func XMLSpecificOf(n *Node) XMLSpecific

XMLSpecificOf returns the XMLSpecific field of a Node. Note if the Node isn't of XML, this function will panic.

type XMLStreamReader

type XMLStreamReader struct {
	// contains filtered or unexported fields
}

XMLStreamReader is a streaming XML to *Node reader.

func NewXMLStreamReader

func NewXMLStreamReader(r io.Reader, xpathStr string) (*XMLStreamReader, error)

NewXMLStreamReader creates a new instance of XML streaming reader.

func (*XMLStreamReader) AtLine

func (sp *XMLStreamReader) AtLine() int

AtLine returns the **rough** line number of the current XML decoder.

func (*XMLStreamReader) Read

func (sp *XMLStreamReader) Read() (n *Node, err error)

Read returns a *Node that matches the xpath streaming criteria.

func (*XMLStreamReader) Release

func (sp *XMLStreamReader) Release(n *Node)

Release releases the *Node (and its subtree) that Read() has previously returned. Note even if Release is not explicitly called, next Read() call will still release the current streaming target node.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL