The highest tagged major version is v2.

dataset

package module

v0.0.32-dev Latest Latest Go to latest Published: Mar 8, 2018 License: BSD-3-Clause Imports: 52 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/caltechlibrary/dataset

README ¶

dataset

dataset is a command line tool for working with JSON (object) documents stored as collections. This supports basic storage actions (e.g. CRUD operations, filtering and extraction) as well as indexing, searching. A project goal of dataset is to "play nice" with shell scripts and other Unix tools (e.g. it respects standard in, out and error with minimal side effects). This means it is easily scriptable via Bash, Posix shell or interpretted languages like R.

dataset includes an implementation as a Python3 module. The same functionality as in the command line tool is replicated for Python3.

Finally dataset is a golang package for managing JSON documents and their attachments on disc or in cloud storage (e.g. Amazon S3, Google Cloud Storage). The command line utilities excersize this package extensively.

The inspiration for creating dataset was the desire to process metadata as JSON document collections using Unix shell utilities and pipe lines. While it has grown in capabilities that remains a core use case.

dataset organanizes JSON documents by unique names in collections. Collections are represented as an index into a series of buckets. The buckets are subdirectories (or paths under cloud storage services). Buckets hold individual JSON documents and their attachments. The JSON document is assigned automatically to a bucket (and the bucket generated if necessary) when it is added to a collection. Assigning documents to buckets avoids having too many documents assigned to a single path (e.g. on some Unix there is a limit to how many documents are held in a single directory). In addition to using the dataset comnad you can list and manipulate the JSON documents directly with common Unix commands like ls, find, grep or their cloud counter parts.

See getting-started-with-datataset.md for a tour of functionality.

Limitations of dataset

dataset has many limitations, some are listed below

it is not a multi-process, multi-user data store (it's just files on disc)
it is not a repository management system
it is not a general purpose multiuser database system

Operations

The basic operations support by dataset are listed below organized by collection and JSON document level.

Collection Level

init creates a collection
import-csv JSON documents from rows of a CSV file
import-gsheet JSON documents from rows of a Google Sheet
export-csv JSON documents from a collection into a CSV file
export-gsheet JSON documents from a collection into a Google Sheet
keys list keys of JSON documents in a collection, supports filtering and sorting
haskey returns true if key is found in collection, false otherwise
count returns the number of documents in a collection, supports filtering for subsets
extract unique JSON attribute values from a collection

JSON Document level

create a JSON document in a collection
read back a JSON document in a collection
update a JSON document in a collection
delete a JSON document in a collection
join a JSON document with a document in a collection
list the lists JSON records as an array for the supplied keys
path list the file path for a JSON document in a collection

JSON Document Attachments

attach a file to a JSON document in a collection
attachments lists the files attached to a JSON document in a collection
detach retrieve an attached file associated with a JSON document in a collection
prune delete one or more attached files of a JSON document in a collection

Search

indexer indexes JSON documents in a collection for searching with find
deindexer de-indexes (removes) JSON documents from an index
find provides a index based full text search interface for collections

Example

Common operations using the dataset command line tool

create collection
create a JSON document to collection
read a JSON document
update a JSON document
delete a JSON document

    # Create a collection "mystuff.ds", the ".ds" lets the bin/dataset command know that's the collection to use. 
    bin/dataset mystuff.ds init
    # if successful then you should see an OK otherwise an error message

    # Create a JSON document 
    bin/dataset mystuff.ds create freda '{"name":"freda","email":"freda@inverness.example.org"}'
    # If successful then you should see an OK otherwise an error message

    # Read a JSON document
    bin/dataset mystuff.ds read freda
    
    # Path to JSON document
    bin/dataset mystuff.ds path freda

    # Update a JSON document
    bin/dataset mystuff.ds update freda '{"name":"freda","email":"freda@zbs.example.org", "count": 2}'
    # If successful then you should see an OK or an error message

    # List the keys in the collection
    bin/dataset mystuff.ds keys

    # Get keys filtered for the name "freda"
    bin/dataset mystuff.ds keys '(eq .name "freda")'

    # Join freda-profile.json with "freda" adding unique key/value pairs
    bin/dataset mystuff.ds join append freda freda-profile.json

    # Join freda-profile.json overwriting in commont key/values adding unique key/value pairs
    # from freda-profile.json
    bin/dataset mystuff.ds join overwrite freda freda-profile.json

    # Delete a JSON document
    bin/dataset mystuff.ds delete freda

    # Import data from a CSV file using column 1 as key
    bin/dataset -quiet -nl=false mystuff.ds import-csv my-data.csv 1

    # To remove the collection just use the Unix shell command
    rm -fR mystuff.ds

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.

Documentation ¶

Overview ¶

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset provides a common approach for storing JSON object documents on local disc or on S3 and Google Cloud Storage. It is intended as a single user system for intermediate processing of JSON content for analysis or batch processing. It is not a database management system (if you need a JSON database system I would suggest looking at Couchdb, Mongo and Redis as a starting point).

The approach dataset takes to storing buckets is to maintain a JSON document with keys (document names) and bucket assignments. JSON documents (and possibly their attachments) are then stored based on that assignment. Conversely the collection.json document is used to find and retrieve documents from the collection. The layout of the metadata is as follows

+ Collection

Collection/collection.json - metadata for retrieval
Collection/[Buckets] - usually an "aa" to "zz" list of buckets
Collection/[Bucket]/[Document]

A key feature of dataset is to be Posix shell friendly. This has lead to storing the JSON documents in a directory structure that standard Posix tooling can traverse. It has also mean that the JSON documents themselves remain on "disc" as plain text. This has facilitated integration with many other applications, programming langauages and systems.

Attachments are non-JSON documents explicitly "attached" that share the same basename but are placed in a tar ball (e.g. document Jane.Doe.json attachements would be stored in Jane.Doe.tar).

Additional operations beside storing and reading JSON documents are also supported. These include creating lists (arrays) of JSON documents from a list of keys, listing keys in the collection, counting documents in the collection, indexing and searching by indexes.

The primary use case driving the development of dataset is harvesting API content for library systems (e.g. EPrints, Invenio, ArchivesSpace, ORCID, CrossRef, OCLC). The harvesting needed to be done in such a way as to leverage existing Posix tooling (e.g. grep, sed, etc) for processing and analysis.

Initial use case:

Caltech Library has many repository, catelog and record management systems (e.g. EPrints, Invenion, ArchivesSpace, Islandora, Invenio). It is common practice to harvest data from these systems for analysis or processing. Harvested records typically come in XML or JSON format. JSON has proven a flexibly way for working with the data and in our more modern tools the common format we use to move data around. We needed a way to standardize how we stored these JSON records for intermediate processing to allow us to use the growing ecosystem of JSON related tooling available under Posix/Unix compatible systems.

Aproach to file system layout ¶

+ /dataset (directory on file system)

collection (directory on file system)
collection.json - metadata about collection
maps the filename of the JSON blob stored to a bucket in the collection
e.g. file "mydocs.jons" stored in bucket "aa" would have a map of {"mydocs.json": "aa"}
keys.json - a list of keys in the collection (it is the default select list)
BUCKETS - a sequence of alphabet names for buckets holding JSON documents and their attachments
Buckets let supporting common commands like ls, tree, etc. when the doc count is high
SELECT_LIST.json - a JSON document holding an array of keys
the default select list is "keys", it is not mutable by Push, Pop, Shift and Unshift
select lists cannot be named "keys" or "collection"

BUCKETS are names without meaning normally using Alphabetic characters. A dataset defined with four buckets might looks like aa, ab, ba, bb. These directories will contains JSON documents and a tar file if the document has attachments.

Operations ¶

+ Collection level

InitCollection (collection) - creates or opens collection structure on disc, creates collection.json and keys.json if new
Open (collection) - opens an existing collections and reads collection.json into memory
Close (collection) - writes changes to collection.json to disc if dirty
Keys (collection) - list of keys in the collection

+ JSON document level

Create (JSON document) - saves a new JSON blob or overwrites and existing one on disc with given blob name, updates keys.json if needed
Read (JSON document)) - finds the JSON document in the buckets and returns the JSON document contents
Update (JSON document) - updates an existing blob on disc (record must already exist)
Delete (JSON document) - removes a JSON blob from its disc
Path (JSON document) - returns the path to the JSON document

+ Select list level

Count (select list) - returns the number of keys in a select list

Example ¶

Common operations using the *dataset* command line tool

+ create collection + create a JSON document to collection + read a JSON document + update a JSON document + delete a JSON document

Example Bash script usage

# Create a collection "mystuff.ds" inside the directory called demo
dataset init mystuff.ds
# if successful an expression to export the collection name is show
export DATASET="mystuff.ds"

# Create a JSON document
dataset create freda.json '{"name":"freda","email":"freda@inverness.example.org"}'
# If successful then you should see an OK or an error message

# Read a JSON document
dataset read freda.json

# Path to JSON document
dataset path freda.json

# Update a JSON document
dataset update freda.json '{"name":"freda","email":"freda@zbs.example.org"}'
# If successful then you should see an OK or an error message

# List the keys in the collection
dataset keys

# Delete a JSON document
dataset delete freda.json

# To remove the collection just use the Unix shell command
# /bin/rm -fR mystuff.ds

Common operations shown in Golang ¶

+ create collection + create a JSON document to collection + read a JSON document + update a JSON document + delete a JSON document

Example Go code

// Create a collection "mystuff" inside the directory called demo
collection, err := dataset.InitCollection("mystuff.ds")
if err != nil {
    log.Fatalf("%s", err)
}
defer collection.Close()
// Create a JSON document
docName := "freda.json"
document := map[string]string{"name":"freda","email":"freda@inverness.example.org"}
if err := collection.Create(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Attach an image file to freda.json in the collection
if buf, err := ioutil.ReadAll("images/freda.png"); err != nil {
   collection.Attach("freda", "images/freda.png", buf)
} else {
   log.Fatalf("%s", err)
}
// Read a JSON document
if err := collection.Read(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Update a JSON document
document["email"] = "freda@zbs.example.org"
if err := collection.Update(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Delete a JSON document
if err := collection.Delete(docName); err != nil {
    log.Fatalf("%s", err)
}

Working with attachments in Go

    collection, err := dataset.Open("dataset/mystuff")
    if err != nil {
        log.Fatalf("%s", err)
    }
    defer collection.Close()

	// Add a helloworld.txt file to freda.json record as an attachment.
    if err := collection.Attach("freda", "docs/helloworld.txt", []byte("Hello World!!!!")); err != nil {
        log.Fatalf("%s", err)
    }

	// Attached files aditional files from the filesystem by their relative file path
	if err := collection.AttachFiles("freda", "docs/presentation-article.pdf", "docs/charts-and-figures.zip", "docs/transcript.fdx") {
        log.Fatalf("%s", err)
	}

	// List the attached files for freda.json
	if filenames, err := collection.Attachments("freda"); err != nil {
        log.Fatalf("%s", err)
	} else {
		fmt.Printf("%s\n", strings.Join(filenames, "\n"))
	}

	// Get an array of attachments (reads in content into memory as an array of Attachment Structs)
	allAttachments, err := collection.GetAttached("freda")
	if err != nil {
        log.Fatalf("%s", err)
	}
	fmt.Printf("all attachments: %+v\n", allAttachments)

	// Get two attachments docs/transcript.fdx, docs/helloworld.txt
	twoAttachments, _ := collection.GetAttached("fred", "docs/transcript.fdx", "docs/helloworld.txt")
	fmt.Printf("two attachments: %+v\n", twoAttachments)

    // Get attached files writing them out to disc relative to your working directory
	if err := collection.GetAttachedFiles("freda"); err != nil {
        log.Fatalf("%s", err)
	}

	// Get two selection attached files writing them out to disc relative to your working directory
	if err := collection.GetAttached("fred", "docs/transcript.fdx", "docs/helloworld.txt"); err != nil {
        log.Fatalf("%s", err)
	}

    // Remove docs/transcript.fdx and docs/helloworld.txt from freda.json attachments
	if err := collection.Detach("fred", "docs/transcript.fdx", "docs/helloworld.txt"); err != nil {
        log.Fatalf("%s", err)
	}

	// Remove all attached files from freda.json
	if err := collection.Detach("fred")
        log.Fatalf("%s", err)
	}