dataset

package module
v0.0.51 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2018 License: BSD-3-Clause Imports: 56 Imported by: 1

README

dataset DOI

dataset is a command line tool, Go package, and an experimental C shared library for working with JSON objects as collections. Collections can be stored on disc or in Cloud Storage. JSON objects are stored in collections as plain UTF-8 text. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data manage operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data frames as well as the ability to import and export JSON objects to and from CSV files and Google Sheets.

dataset is written in the Go programming language. It can be used as a Go package by other Go based software. Go supports generating C shared libraries. By compiling the Go source you can create a libdataset C shared library. The C shared library is currently being used by the DLD Group in Caltech Library experimentally from Python 3. This approach looks promising to support other languages (e.g. Julia can easily use dataset via its ccall function, while R, Octave and NodeJS would probably need some C++ wrapping code).

See getting-started-with-datataset.md for a tour and tutorial.

Design choices

dataset isn't a database or a replacement for repository systems. It is guided by the idea that you should be able to work with text files, the JSON objects documents, with standard Unix text utilities. It is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds would create a new collection called 'mycollection.ds'). It is built around a few abstractions -- dataset stores JSON objects in collections, collections are a folder(s) containing the JSON object documents and any attachments, a collections.json file describes the mapping of keys to folder locations). dataset takes minimal system resources and keeps all content, except JSON object attachments, in plain UTF-8 text. Attachments are stored using the venerable "tar" archive format.

The choice of plain UTF-8 and tar balls is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi while being equally comfortable on a more resource rich server or desktop environment. It should be easy to do alternative implementations in any language that has good string, JSON support and memory management.

Workflows

A typical library processing pattern is to write a "harvester" which then stores it results in a dataset collection. Write something that transforms or aggregates harvested options and then write a final rendering program to prepare the data for the web. The the hearvesters are typically written in Python or as a simple Bash script storing the results in dataset. Depending on the performance needs our transform and aggregates stage are written either in Python or Go and our final rendering stages are typically written in Python or as simple Bash scripts.

Features

dataset supports

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.7 using a python package. dataset is useful for general data science applications which need intermediate JSON object management but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

  • it is not a multi-process, multi-user data store (it's files on "disc" without locking)
  • it is not a replacement for a repository management system
  • it is not a general purpose database system
  • it does not supply version control on collections or objects

Explore dataset through A Shell Example, Getting Started with Dataset, How To guides, topics and Documentation.

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.

Documentation

Overview

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

bucket.go is part of the dataset pacakge includes the operations needed for processing collections of JSON documents and their attachments using the bucket layout.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset provides a common approach for storing JSON object documents on local disc, on S3 and Google Cloud Storage. It is intended as a single user system for intermediate processing of JSON content for analysis or batch processing. It is not a database management system (if you need a JSON database system I would suggest looking at Couchdb, Mongo and Redis as a starting point).

The approach dataset takes to storing buckets is to maintain a JSON document with keys (document names) and bucket assignments. JSON documents (and possibly their attachments) are then stored based on that assignment. Conversely the collection.json document is used to find and retrieve documents from the collection. The layout of the metadata is as follows

+ Collection

  • Collection/collection.json - metadata for retrieval
  • Collection/[Buckets|Pairtree]

A key feature of dataset is to be Posix shell friendly. This has lead to storing the JSON documents in a directory structure that standard Posix tooling can traverse. It has also mean that the JSON documents themselves remain on "disc" as plain text. This has facilitated integration with many other applications, programming langauages and systems.

Attachments are non-JSON documents explicitly "attached" that share the same basename but are placed in a tar ball (e.g. document Jane.Doe.json attachements would be stored in Jane.Doe.tar).

Additional operations beside storing and reading JSON documents are also supported. These include creating lists (arrays) of JSON documents from a list of keys, listing keys in the collection, counting documents in the collection, indexing and searching by indexes.

The primary use case driving the development of dataset is harvesting API content for library systems (e.g. EPrints, Invenio, ArchivesSpace, ORCID, CrossRef, OCLC). The harvesting needed to be done in such a way as to leverage existing Posix tooling (e.g. grep, sed, etc) for processing and analysis.

Initial use case:

Caltech Library has many repository, catelog and record management systems (e.g. EPrints, Invenion, ArchivesSpace, Islandora, Invenio). It is common practice to harvest data from these systems for analysis or processing. Harvested records typically come in XML or JSON format. JSON has proven a flexibly way for working with the data and in our more modern tools the common format we use to move data around. We needed a way to standardize how we stored these JSON records for intermediate processing to allow us to use the growing ecosystem of JSON related tooling available under Posix/Unix compatible systems.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	// Version of the dataset package
	Version = `v0.0.51`

	// License is a formatted from for dataset package based command line tools
	License = `` /* 1530-byte string literal not displayed */

	// Sort directions
	ASC  = iota
	DESC = iota

	// Supported file layout types
	// Assume an unknown layout is zero, then add consts in order of adoption
	UNKNOWN_LAYOUT = iota

	// Buckets is the first file layout implemented when dataset started
	BUCKETS_LAYOUT

	// Pairtree is the perferred file layout moving forward
	PAIRTREE_LAYOUT
)
View Source
const (
	DefaultAlphabet = `abcdefghijklmnopqrstuvwxyz`
)

Variables

View Source
var DefaultBucketNames = []string{}/* 676 elements not displayed */

DefaultBucketNames provides a A-Z list of bucket names with a length 2

Functions

func Analyzer added in v0.0.3

func Analyzer(collectionName string) error

Analyzer checks the collection version and either calls bucketAnalyzer or pairtreeAnalyzer as appropriate.

func CSVFormatter added in v0.0.3

func CSVFormatter(out io.Writer, results *bleve.SearchResult, colNames []string, skipHeaderRow bool) error

CSVFormatter writes out CSV representation using encoding/csv

func CollectionLayout added in v0.0.45

func CollectionLayout(p string) int

CollectionLayout returns the numeric type association with the collection (e.g BUCKETS_LAYOUT, PAIRTREE_LAYOUT).

func Deindexer added in v0.0.33

func Deindexer(idxName string, keys []string, batchSize int) error

Deindexer deletes the keys from an index. Returns an error if a problem occurs.

func Delete

func Delete(name string) error

Delete an entire collection

func Find

func Find(idxAlias bleve.IndexAlias, queryString string, options map[string]string) (*bleve.SearchResult, error)

Find takes a Bleve index name and query string, opens the index, and writes the results to the os.File provided. Function returns an error if their are problems.

func Formatter added in v0.0.3

func Formatter(out io.Writer, results *bleve.SearchResult, tmpl *template.Template, tName string, pageData map[string]string) error

Formatter writes out a format based on the specified template name merging any additional pageData provided

func IsCollection added in v0.0.45

func IsCollection(p string) bool

IsCollection checks to see if a given path contains a collection.json file

func JSONFormatter added in v0.0.3

func JSONFormatter(out io.Writer, results *bleve.SearchResult, prettyPrint bool) error

JSONFormatter writes out JSON representation using encoding/json

func Migrate added in v0.0.45

func Migrate(collectionName string, newLayout int) error

func Repair added in v0.0.3

func Repair(collectionName string) error

Repair takes a collection name and calls wither bucketRepair or pairtreeRepair as appropriate.

Types

type Attachment

type Attachment struct {
	// Name is the filename and path to be used inside the generated tar file
	Name string `json:"name"`
	// Body is a byte array for storing the content associated with Name
	Body []byte `json:"-"`
	// Size
	Size int64 `json:"size"`
}

Attachment is a structure for holding non-JSON content you wish to store alongside a JSON document in a collection

type Collection

type Collection struct {
	// DatasetVersion of the collection
	DatasetVersion string `json:"dataset_version"`

	// Name of collection
	Name string `json:"name"`

	// Type allows for transitioning from bucket layout to pairtree layout for collections.
	Layout int `json:"layout"`

	// Buckets is a list of bucket names used by collection (depreciated, will be removed after migration to pairtree)
	Buckets []string `json:"buckets,omitempty"`

	// KeyMap holds the document key to path in the collection
	KeyMap map[string]string `json:"keymap"`

	// Store holds the storage system information (e.g. local disc, S3, GS)
	// and related methods for interacting with it
	Store *storage.Store `json:"-"`

	// FrameMap is a list of frame names and with rel path to the frame defined in the collection
	FrameMap map[string]string `json:"frames"`

	// Who - creator, owner, maintainer name(s)
	Who []string `json:"who,omitempty"`
	// What - description of collection
	What string `json:"what,omitempty"`
	// When - date associated with collection (e.g. 2018, 2018-10, 2018-10-02)
	When string `json:"when,omitempty"`
	// Where - location (e.g. URL, address) of collection
	Where string `json:"where,omitempty"`
	// Version of collection being stored in semvar notation
	Version string `json:"version,omitempty"`
	// Contact info
	Contact string `json:"contact,omitempty"`
	// contains filtered or unexported fields
}

Collection is the container holding buckets which in turn hold JSON docs

func InitCollection added in v0.0.8

func InitCollection(name string, layoutType int) (*Collection, error)

InitCollection - creates a new collection with default alphabet and names of length 2. NOTE: layoutType is provided to allow for future changes in the file layout of a collection.

func Open

func Open(name string) (*Collection, error)

Open reads in a collection's metadata and returns and new collection structure and err

func (*Collection) AttachFile added in v0.0.33

func (c *Collection) AttachFile(keyName, fName string, buf io.Reader) error

AttachFile is for attaching a single non-JSON document to a dataset record. It will replace ANY existing attached content (i.e. it creates an new tarball holding on this single document) This is a limitation of our storage package supporting a minimal set of operation across all storage environments (e.g. S3/Google Cloud Storage do not support append to file, only replacement). It takes the document key, name and an io.Reader reading content in and appending the results to the tar file updating the internal _Attributes metadata as needed.

func (*Collection) AttachFiles

func (c *Collection) AttachFiles(name string, fileNames ...string) error

AttachFiles attaches non-JSON documents to a JSON document in the collection. Attachments are stored in a tar file, if tar file exits then attachment(s) are appended to tar file.

func (*Collection) Attachments

func (c *Collection) Attachments(name string) ([]string, error)

Attachments returns a list of files in the attached tarball for a given name in the collection

func (*Collection) Clone added in v0.0.39

func (c *Collection) Clone(cloneName string, keys []string, verbose bool) error

Clone copies the current collection records into a newly initialized collection given a list of keys and new collection name. Returns an error value if there is a problem. Clone does NOT copy attachments, only the JSON records.

func (*Collection) CloneSample added in v0.0.39

func (c *Collection) CloneSample(trainingCollectionName string, testCollectionName string, keys []string, sampleSize int, verbose bool) error

CloneSample takes the current collection, a sample size, a training collection name and a test collection name. The training collection will be created and receive a random sample of the records from the current collection based on the sample size provided. Sample size must be greater than zero and less than the total number of records in the current collection.

If the test collection name is not an empty string it will be created and any records not in the training collection will be cloned from the current collection into the test collection.

func (*Collection) Close

func (c *Collection) Close() error

Close closes a collection, writing the updated keys to disc

func (*Collection) Create

func (c *Collection) Create(name string, data map[string]interface{}) error

Create a JSON doc from an map[string]interface{} and adds it to a collection, if problem returns an error name must be unique. Document must be an JSON object (not an array).

func (*Collection) CreateJSON added in v0.0.33

func (c *Collection) CreateJSON(key string, src []byte) error

CreateJSON adds a JSON doc to a collection, if a problem occurs it returns an error

func (*Collection) Deindexer added in v0.0.33

func (c *Collection) Deindexer(idxName string, keys []string, batchSize int) error

Deindexer removes items from an index on a collection.

func (*Collection) Delete

func (c *Collection) Delete(name string) error

Delete removes a JSON doc from a collection

func (*Collection) DeleteFrame added in v0.0.41

func (c *Collection) DeleteFrame(name string) error

DeleteFrame removes a frame from a collection, returns an error if frame can't be deleted.

func (*Collection) DocPath

func (c *Collection) DocPath(name string) (string, error)

DocPath returns a full path to a key or an error if not found

func (*Collection) ExportCSV added in v0.0.3

func (c *Collection) ExportCSV(fp io.Writer, eout io.Writer, f *DataFrame, verboseLog bool) (int, error)

ExportCSV takes a reader and frame and iterates over the objects generating rows and exports then as a CSV file

func (*Collection) ExportTable added in v0.0.47

func (c *Collection) ExportTable(eout io.Writer, f *DataFrame, verboseLog bool) (int, [][]interface{}, error)

ExportTable takes a reader and frame and iterates over the objects generating rows and exports then as a CSV file

func (*Collection) Frame added in v0.0.41

func (c *Collection) Frame(name string, keys []string, dotPaths []string, verbose bool) (*DataFrame, error)

Frame takes a set of collection keys and dotpaths, builds a grid and assembles the grid and metadata returning a new CollectionFrame and error. Frames are associated with the collection and can be re-generated.

func (*Collection) FrameLabels added in v0.0.41

func (c *Collection) FrameLabels(name string, labels []string) error

FrameLabels sets the labels for a frame, the number of labels must match the number of dot paths (columns) in the frame.

func (*Collection) Frames added in v0.0.41

func (c *Collection) Frames() []string

Frames retrieves a list of available frames associated with a collection

func (*Collection) GetAttachedFiles

func (c *Collection) GetAttachedFiles(name string, filterNames ...string) error

GetAttachedFiles returns an error if encountered, side effect is to write file to destination directory If no filterNames provided then return all attachments or error

func (*Collection) Grid added in v0.0.41

func (c *Collection) Grid(keys []string, dotPaths []string, verbose bool) ([][]interface{}, error)

Grid takes a set of collection keys and builds a grid (a 2D array cells) from the array of keys and dot paths provided

func (*Collection) HasFrame added in v0.0.47

func (c *Collection) HasFrame(name string) bool

HasFrame checkes to see if a frame is already defined.

func (*Collection) HasKey added in v0.0.3

func (c *Collection) HasKey(key string) bool

HasKey returns true if key is in collection's KeyMap, false otherwise

func (*Collection) ImportCSV added in v0.0.3

func (c *Collection) ImportCSV(buf io.Reader, idCol int, skipHeaderRow bool, overwrite bool, verboseLog bool) (int, error)

ImportCSV takes a reader and iterates over the rows and imports them as a JSON records into dataset. BUG: returns lines processed should probably return number of rows imported

func (*Collection) ImportTable added in v0.0.4

func (c *Collection) ImportTable(table [][]interface{}, idCol int, useHeaderRow bool, overwrite, verboseLog bool) (int, error)

ImportTable takes a [][]interface{} and iterates over the rows and imports them as a JSON records into dataset.

func (*Collection) Indexer

func (c *Collection) Indexer(idxName string, idxMapName string, keys []string, batchSize int) error

Indexer generates or updates and a Bleve index based on the index map filename provided, a list of keys and batch size.

func (*Collection) Join added in v0.0.47

func (c *Collection) Join(key string, obj map[string]interface{}, overwrite bool) error

Join takes a key, a map[string]interface{}{} and overwrite bool and merges the map with an existing JSON object in the collection. BUG: This is a naive join, it assumes the keys in object are top level properties.

func (*Collection) KeyFilter added in v0.0.33

func (c *Collection) KeyFilter(keyList []string, filterExpr string) ([]string, error)

KeyFilter takes a list of keys and filter expression and returns the list of keys passing through the filter or an error

func (*Collection) KeySortByExpression added in v0.0.33

func (c *Collection) KeySortByExpression(keys []string, expr string) ([]string, error)

KeySortByExpression takes a array of keys and a sort expression and turns a sorted list of keys.

func (*Collection) Keys

func (c *Collection) Keys() []string

Keys returns a list of keys in a collection

func (*Collection) Length added in v0.0.6

func (c *Collection) Length() int

Length returns the number of keys in a collection

func (*Collection) MergeFromTable added in v0.0.47

func (c *Collection) MergeFromTable(frameName string, table [][]interface{}, overwrite bool, verbose bool) error

MergeFromTable - uses a DataFrame associated in the collection to map columns from a table into JSON object attributes saving the JSON object in the collection. If overwrite is true then JSON objects for matching keys will be updated, if false only new objects will be added to collection. Returns an error value

func (*Collection) MergeIntoTable added in v0.0.47

func (c *Collection) MergeIntoTable(frameName string, table [][]interface{}, overwrite bool, verbose bool) ([][]interface{}, error)

MergeIntoTable - uses a DataFrame associated in the collection to map attributes into table appending new content and optionally overwriting existing content for rows with matching ids. Returns a new table (i.e. [][]interface{}) or error.

func (*Collection) Prune added in v0.0.33

func (c *Collection) Prune(name string, filterNames ...string) error

Prune a non-JSON document from a JSON document in the collection.

func (*Collection) Read

func (c *Collection) Read(name string, data map[string]interface{}) error

Read finds the record in a collection, updates the data interface provide and if problem returns an error name must exist or an error is returned

func (*Collection) ReadJSON added in v0.0.33

func (c *Collection) ReadJSON(name string) ([]byte, error)

ReadJSON finds a the record in the collection and returns the JSON source

func (*Collection) Reframe added in v0.0.41

func (c *Collection) Reframe(name string, keys []string, verbose bool) error

Reframe will re-generate contents of a frame based on the current records in a collection. If a list of keys is supplied then the regenerated frame will be based on the new set of keys provided

func (*Collection) SaveFrame added in v0.0.47

func (c *Collection) SaveFrame(name string, f *DataFrame) error

SaveFrame saves a frame in a collection or returns an error

func (*Collection) SaveMetadata added in v0.0.48

func (c *Collection) SaveMetadata() error

SaveMetadata writes the collection's metadata to c.Store and c.workPath

func (*Collection) Update

func (c *Collection) Update(name string, data map[string]interface{}) error

Update JSON doc in a collection from the provided data interface (note: JSON doc must exist or returns an error )

func (*Collection) UpdateJSON added in v0.0.33

func (c *Collection) UpdateJSON(name string, src []byte) error

UpdateJSON a JSON doc in a collection, returns an error if there is a problem

type DataFrame added in v0.0.41

type DataFrame struct {
	// Explicit at creation
	Name           string   `json:"frame_name"`
	CollectionName string   `json:"collection_name"`
	DotPaths       []string `json:"dot_paths"`
	// NOTE: Keys should hold the same values as column zero of the grid.
	// Keys controls the order of rows in a grid when reframing.
	Keys    []string        `json:"keys"`
	Grid    [][]interface{} `json:"grid"`
	Created time.Time       `json:"created"`
	Updated time.Time       `json:"updated,omitempty"`

	// NOTE: these values effect how Reframe works
	AllKeys    bool   `json:"use_all_keys"`
	FilterExpr string `json:"filter_expr,omitempty"`
	SortExpr   string `json:"sort_expr,omitempty"`
	SampleSize int    `json:"sample_size"`

	// Derived or explicitly set after creation
	Labels []string `json:"labels,omitempty"`
}

func (*DataFrame) String added in v0.0.41

func (f *DataFrame) String() string

String renders the data structure DataFrame as JSON to a string

type IndexList added in v0.0.33

type IndexList struct {
	Names   []string
	Fields  []string
	Alias   bleve.IndexAlias
	Indexes []bleve.Index
}

func OpenIndexes added in v0.0.3

func OpenIndexes(indexNames []string) (*IndexList, []string, error)

OpenIndexes opens a list of index names and returns an index alias, a combined list of fields and error

func (*IndexList) Close added in v0.0.33

func (idxList *IndexList) Close() error

Close removes all the indexes from a list associated with idx.Alias, then closes the related indexes. idx.Alias.Remove(idx.Indexes) returning error

type KeyValue added in v0.0.7

type KeyValue struct {
	// JSON Record ID in collection
	ID string
	// The value of the field to be sorted from record
	Value interface{}
}

type KeyValues added in v0.0.7

type KeyValues []KeyValue

func (KeyValues) Len added in v0.0.7

func (a KeyValues) Len() int

func (KeyValues) Less added in v0.0.7

func (a KeyValues) Less(i, j int) bool

func (KeyValues) Swap added in v0.0.7

func (a KeyValues) Swap(i, j int)

Directories

Path Synopsis
analyzers
cmd
dataset
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on disc, in an S3 bucket or in Cloud Storage @Author R. S. Doiel, <rsdoiel@library.caltech.edu> Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on disc, in an S3 bucket or in Cloud Storage @Author R. S. Doiel, <rsdoiel@library.caltech.edu> Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.
gsheets.go is a part of the dataset package written to allow import/export of records to/from dataset collections.
gsheets.go is a part of the dataset package written to allow import/export of records to/from dataset collections.
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL