The highest tagged major version is v2.

dataset

package module

v0.0.51 Latest Latest Go to latest Published: Nov 29, 2018 License: BSD-3-Clause Imports: 56 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/caltechlibrary/dataset

Links

Open Source Insights

README ¶

dataset

dataset is a command line tool, Go package, and an experimental C shared library for working with JSON objects as collections. Collections can be stored on disc or in Cloud Storage. JSON objects are stored in collections as plain UTF-8 text. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data manage operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data frames as well as the ability to import and export JSON objects to and from CSV files and Google Sheets.

dataset is written in the Go programming language. It can be used as a Go package by other Go based software. Go supports generating C shared libraries. By compiling the Go source you can create a libdataset C shared library. The C shared library is currently being used by the DLD Group in Caltech Library experimentally from Python 3. This approach looks promising to support other languages (e.g. Julia can easily use dataset via its ccall function, while R, Octave and NodeJS would probably need some C++ wrapping code).

See getting-started-with-datataset.md for a tour and tutorial.

Design choices

dataset isn't a database or a replacement for repository systems. It is guided by the idea that you should be able to work with text files, the JSON objects documents, with standard Unix text utilities. It is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds would create a new collection called 'mycollection.ds'). It is built around a few abstractions -- dataset stores JSON objects in collections, collections are a folder(s) containing the JSON object documents and any attachments, a collections.json file describes the mapping of keys to folder locations). dataset takes minimal system resources and keeps all content, except JSON object attachments, in plain UTF-8 text. Attachments are stored using the venerable "tar" archive format.

The choice of plain UTF-8 and tar balls is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi while being equally comfortable on a more resource rich server or desktop environment. It should be easy to do alternative implementations in any language that has good string, JSON support and memory management.

Workflows

A typical library processing pattern is to write a "harvester" which then stores it results in a dataset collection. Write something that transforms or aggregates harvested options and then write a final rendering program to prepare the data for the web. The the hearvesters are typically written in Python or as a simple Bash script storing the results in dataset. Depending on the performance needs our transform and aggregates stage are written either in Python or Go and our final rendering stages are typically written in Python or as simple Bash scripts.

Features

dataset supports

Basic storage actions (create, read, update and delete)
listing of collection keys (including filtering and sorting)
import/export of CSV files and Google Sheets
An experimental full text search interface based on Blevesearch
The ability to reshape data by performing simple object joins
The ability to create data grids and frames from collections based on keys lists and dot paths into the JSON objects stored

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.7 using a python package. dataset is useful for general data science applications which need intermediate JSON object management but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

it is not a multi-process, multi-user data store (it's files on "disc" without locking)
it is not a replacement for a repository management system
it is not a general purpose database system
it does not supply version control on collections or objects

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.

Documentation ¶

Overview ¶

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

bucket.go is part of the dataset pacakge includes the operations needed for processing collections of JSON documents and their attachments using the bucket layout.