xmlcutty

package module
v0.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 1, 2019 License: GPL-3.0 Imports: 1 Imported by: 0

README

README

The game ain't in me no more. None of it.

xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.

Why? Background.

Install

Use a deb or rpm release. It's in AUR, too.

Or install with the go tool:

$ go get github.com/miku/xmlcutty/cmd/xmlcutty

Usage

$ cat fixtures/sample.xml
<a>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</a>

Options:

$ xmlcutty -h
Usage of xmlcutty:
  -path string
        select path (default "/")
  -rename string
        rename wrapper element to this name
  -root string
        synthetic root element
  -v    show version

It looks a bit like XPath, but it really is only a simple matcher.

$ xmlcutty -path /a fixtures/sample.xml
<a>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</a>

You specify a path, e.g. /a/b and all elements matching this path are printed:

$ xmlcutty -path /a/b fixtures/sample.xml
<b>
    <c></c>
</b>
<b>
    <c></c>
</b>

You can end up with an XML document without a root. To make tools like xmllint happy, you can add a synthetic root element on the fly:

$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hello>
    <b>
        <c></c>
    </b>
    <b>
        <c></c>
    </b>
</hello>

Rename wrapper element - that is the last element of the matching path:

$ xmlcutty -rename beee -path /a/b fixtures/sample.xml
<beee>
    <c></c>
</beee>
<beee>
    <c></c>
</beee>

All options, synthetic root element and a renamed path element:

$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hi>
    <ceee/>
    <ceee/>
</hi>

It will parse XML files without a root element just fine.

$ head fixtures/oai.xml
<record>
    <header>
        <identifier>oai:arXiv.org:0704.0004</identifier>
        <datestamp>2007-05-23</datestamp>
        <setSpec>math</setSpec>
    </header>
    <metadata>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"... >
            <dc:title>A determinant of Stirling cycle numbers counts ...
            <dc:type>text</dc:type>
            <dc:identifier>http://arxiv.org/abs/0704.0004</dc:identifier>
...

This is an example XML response from a web service. We can slice out the identifier elements. Note that any namespace - here oai_dc - is completely ignored for the sake of simplicity:

$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \
                       | xmllint --format -
<?xml version="1.0"?>
<x>
    <identifier>http://arxiv.org/abs/0704.0004</identifier>
    <identifier>http://arxiv.org/abs/0704.0010</identifier>
    <identifier>http://arxiv.org/abs/0704.0012</identifier>
</x>

We can go a bit further and extract the text element, which is like a poor man text() in XPath terms. By using the a newline as argument to rename, we effectively get rid of the enclosing XML tag:

$ cat fixtures/oai.xml | xmlcutty -rename '\n' -path /record/metadata/dc/identifier \
                       | grep -v "^$"
http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012

This last feature is nice to quickly extract text from large XML files.

Documentation

Overview

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de

The Finc Authors, http://finc.info
Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

package xmlcutty implements support for the xmlcutty command line tool.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type StringStack

type StringStack struct {
	// contains filtered or unexported fields
}

StringStack implements LIFO. Not thread safe.

func (*StringStack) Pop

func (q *StringStack) Pop() string

Pop removes the last added element from the stack and returns it. Panics on an empty stack.

func (*StringStack) Push

func (q *StringStack) Push(s string)

Push adds an element to the stack.

func (*StringStack) String

func (q *StringStack) String() string

String formats the stack in a path-like manner.

func (*StringStack) Top

func (q *StringStack) Top() string

Top retrieves the last added element. Panics on an empty stack.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL