dataset

package module
v0.0.32-dev Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 8, 2018 License: BSD-3-Clause Imports: 52 Imported by: 1

README

dataset DOI

dataset is a command line tool for working with JSON (object) documents stored as collections. This supports basic storage actions (e.g. CRUD operations, filtering and extraction) as well as indexing, searching. A project goal of dataset is to "play nice" with shell scripts and other Unix tools (e.g. it respects standard in, out and error with minimal side effects). This means it is easily scriptable via Bash, Posix shell or interpretted languages like R.

dataset includes an implementation as a Python3 module. The same functionality as in the command line tool is replicated for Python3.

Finally dataset is a golang package for managing JSON documents and their attachments on disc or in cloud storage (e.g. Amazon S3, Google Cloud Storage). The command line utilities excersize this package extensively.

The inspiration for creating dataset was the desire to process metadata as JSON document collections using Unix shell utilities and pipe lines. While it has grown in capabilities that remains a core use case.

dataset organanizes JSON documents by unique names in collections. Collections are represented as an index into a series of buckets. The buckets are subdirectories (or paths under cloud storage services). Buckets hold individual JSON documents and their attachments. The JSON document is assigned automatically to a bucket (and the bucket generated if necessary) when it is added to a collection. Assigning documents to buckets avoids having too many documents assigned to a single path (e.g. on some Unix there is a limit to how many documents are held in a single directory). In addition to using the dataset comnad you can list and manipulate the JSON documents directly with common Unix commands like ls, find, grep or their cloud counter parts.

See getting-started-with-datataset.md for a tour of functionality.

Limitations of dataset

dataset has many limitations, some are listed below

  • it is not a multi-process, multi-user data store (it's just files on disc)
  • it is not a repository management system
  • it is not a general purpose multiuser database system

Operations

The basic operations support by dataset are listed below organized by collection and JSON document level.

Collection Level
  • init creates a collection
  • import-csv JSON documents from rows of a CSV file
  • import-gsheet JSON documents from rows of a Google Sheet
  • export-csv JSON documents from a collection into a CSV file
  • export-gsheet JSON documents from a collection into a Google Sheet
  • keys list keys of JSON documents in a collection, supports filtering and sorting
  • haskey returns true if key is found in collection, false otherwise
  • count returns the number of documents in a collection, supports filtering for subsets
  • extract unique JSON attribute values from a collection
JSON Document level
  • create a JSON document in a collection
  • read back a JSON document in a collection
  • update a JSON document in a collection
  • delete a JSON document in a collection
  • join a JSON document with a document in a collection
  • list the lists JSON records as an array for the supplied keys
  • path list the file path for a JSON document in a collection
JSON Document Attachments
  • attach a file to a JSON document in a collection
  • attachments lists the files attached to a JSON document in a collection
  • detach retrieve an attached file associated with a JSON document in a collection
  • prune delete one or more attached files of a JSON document in a collection
  • indexer indexes JSON documents in a collection for searching with find
  • deindexer de-indexes (removes) JSON documents from an index
  • find provides a index based full text search interface for collections

Example

Common operations using the dataset command line tool

  • create collection
  • create a JSON document to collection
  • read a JSON document
  • update a JSON document
  • delete a JSON document
    # Create a collection "mystuff.ds", the ".ds" lets the bin/dataset command know that's the collection to use. 
    bin/dataset mystuff.ds init
    # if successful then you should see an OK otherwise an error message

    # Create a JSON document 
    bin/dataset mystuff.ds create freda '{"name":"freda","email":"freda@inverness.example.org"}'
    # If successful then you should see an OK otherwise an error message

    # Read a JSON document
    bin/dataset mystuff.ds read freda
    
    # Path to JSON document
    bin/dataset mystuff.ds path freda

    # Update a JSON document
    bin/dataset mystuff.ds update freda '{"name":"freda","email":"freda@zbs.example.org", "count": 2}'
    # If successful then you should see an OK or an error message

    # List the keys in the collection
    bin/dataset mystuff.ds keys

    # Get keys filtered for the name "freda"
    bin/dataset mystuff.ds keys '(eq .name "freda")'

    # Join freda-profile.json with "freda" adding unique key/value pairs
    bin/dataset mystuff.ds join append freda freda-profile.json

    # Join freda-profile.json overwriting in commont key/values adding unique key/value pairs
    # from freda-profile.json
    bin/dataset mystuff.ds join overwrite freda freda-profile.json

    # Delete a JSON document
    bin/dataset mystuff.ds delete freda

    # Import data from a CSV file using column 1 as key
    bin/dataset -quiet -nl=false mystuff.ds import-csv my-data.csv 1

    # To remove the collection just use the Unix shell command
    rm -fR mystuff.ds

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.

Documentation

Overview

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset provides a common approach for storing JSON object documents on local disc or on S3 and Google Cloud Storage. It is intended as a single user system for intermediate processing of JSON content for analysis or batch processing. It is not a database management system (if you need a JSON database system I would suggest looking at Couchdb, Mongo and Redis as a starting point).

The approach dataset takes to storing buckets is to maintain a JSON document with keys (document names) and bucket assignments. JSON documents (and possibly their attachments) are then stored based on that assignment. Conversely the collection.json document is used to find and retrieve documents from the collection. The layout of the metadata is as follows

+ Collection

  • Collection/collection.json - metadata for retrieval
  • Collection/[Buckets] - usually an "aa" to "zz" list of buckets
  • Collection/[Bucket]/[Document]

A key feature of dataset is to be Posix shell friendly. This has lead to storing the JSON documents in a directory structure that standard Posix tooling can traverse. It has also mean that the JSON documents themselves remain on "disc" as plain text. This has facilitated integration with many other applications, programming langauages and systems.

Attachments are non-JSON documents explicitly "attached" that share the same basename but are placed in a tar ball (e.g. document Jane.Doe.json attachements would be stored in Jane.Doe.tar).

Additional operations beside storing and reading JSON documents are also supported. These include creating lists (arrays) of JSON documents from a list of keys, listing keys in the collection, counting documents in the collection, indexing and searching by indexes.

The primary use case driving the development of dataset is harvesting API content for library systems (e.g. EPrints, Invenio, ArchivesSpace, ORCID, CrossRef, OCLC). The harvesting needed to be done in such a way as to leverage existing Posix tooling (e.g. grep, sed, etc) for processing and analysis.

Initial use case:

Caltech Library has many repository, catelog and record management systems (e.g. EPrints, Invenion, ArchivesSpace, Islandora, Invenio). It is common practice to harvest data from these systems for analysis or processing. Harvested records typically come in XML or JSON format. JSON has proven a flexibly way for working with the data and in our more modern tools the common format we use to move data around. We needed a way to standardize how we stored these JSON records for intermediate processing to allow us to use the growing ecosystem of JSON related tooling available under Posix/Unix compatible systems.

Aproach to file system layout

+ /dataset (directory on file system)

  • collection (directory on file system)
  • collection.json - metadata about collection
  • maps the filename of the JSON blob stored to a bucket in the collection
  • e.g. file "mydocs.jons" stored in bucket "aa" would have a map of {"mydocs.json": "aa"}
  • keys.json - a list of keys in the collection (it is the default select list)
  • BUCKETS - a sequence of alphabet names for buckets holding JSON documents and their attachments
  • Buckets let supporting common commands like ls, tree, etc. when the doc count is high
  • SELECT_LIST.json - a JSON document holding an array of keys
  • the default select list is "keys", it is not mutable by Push, Pop, Shift and Unshift
  • select lists cannot be named "keys" or "collection"

BUCKETS are names without meaning normally using Alphabetic characters. A dataset defined with four buckets might looks like aa, ab, ba, bb. These directories will contains JSON documents and a tar file if the document has attachments.

Operations

+ Collection level

  • InitCollection (collection) - creates or opens collection structure on disc, creates collection.json and keys.json if new
  • Open (collection) - opens an existing collections and reads collection.json into memory
  • Close (collection) - writes changes to collection.json to disc if dirty
  • Keys (collection) - list of keys in the collection

+ JSON document level

  • Create (JSON document) - saves a new JSON blob or overwrites and existing one on disc with given blob name, updates keys.json if needed
  • Read (JSON document)) - finds the JSON document in the buckets and returns the JSON document contents
  • Update (JSON document) - updates an existing blob on disc (record must already exist)
  • Delete (JSON document) - removes a JSON blob from its disc
  • Path (JSON document) - returns the path to the JSON document

+ Select list level

  • Count (select list) - returns the number of keys in a select list

Example

Common operations using the *dataset* command line tool

+ create collection + create a JSON document to collection + read a JSON document + update a JSON document + delete a JSON document

Example Bash script usage

# Create a collection "mystuff.ds" inside the directory called demo
dataset init mystuff.ds
# if successful an expression to export the collection name is show
export DATASET="mystuff.ds"

# Create a JSON document
dataset create freda.json '{"name":"freda","email":"freda@inverness.example.org"}'
# If successful then you should see an OK or an error message

# Read a JSON document
dataset read freda.json

# Path to JSON document
dataset path freda.json

# Update a JSON document
dataset update freda.json '{"name":"freda","email":"freda@zbs.example.org"}'
# If successful then you should see an OK or an error message

# List the keys in the collection
dataset keys

# Delete a JSON document
dataset delete freda.json

# To remove the collection just use the Unix shell command
# /bin/rm -fR mystuff.ds

Common operations shown in Golang

+ create collection + create a JSON document to collection + read a JSON document + update a JSON document + delete a JSON document

Example Go code

// Create a collection "mystuff" inside the directory called demo
collection, err := dataset.InitCollection("mystuff.ds")
if err != nil {
    log.Fatalf("%s", err)
}
defer collection.Close()
// Create a JSON document
docName := "freda.json"
document := map[string]string{"name":"freda","email":"freda@inverness.example.org"}
if err := collection.Create(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Attach an image file to freda.json in the collection
if buf, err := ioutil.ReadAll("images/freda.png"); err != nil {
   collection.Attach("freda", "images/freda.png", buf)
} else {
   log.Fatalf("%s", err)
}
// Read a JSON document
if err := collection.Read(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Update a JSON document
document["email"] = "freda@zbs.example.org"
if err := collection.Update(docName, document); err != nil {
    log.Fatalf("%s", err)
}
// Delete a JSON document
if err := collection.Delete(docName); err != nil {
    log.Fatalf("%s", err)
}

Working with attachments in Go

    collection, err := dataset.Open("dataset/mystuff")
    if err != nil {
        log.Fatalf("%s", err)
    }
    defer collection.Close()

	// Add a helloworld.txt file to freda.json record as an attachment.
    if err := collection.Attach("freda", "docs/helloworld.txt", []byte("Hello World!!!!")); err != nil {
        log.Fatalf("%s", err)
    }

	// Attached files aditional files from the filesystem by their relative file path
	if err := collection.AttachFiles("freda", "docs/presentation-article.pdf", "docs/charts-and-figures.zip", "docs/transcript.fdx") {
        log.Fatalf("%s", err)
	}

	// List the attached files for freda.json
	if filenames, err := collection.Attachments("freda"); err != nil {
        log.Fatalf("%s", err)
	} else {
		fmt.Printf("%s\n", strings.Join(filenames, "\n"))
	}

	// Get an array of attachments (reads in content into memory as an array of Attachment Structs)
	allAttachments, err := collection.GetAttached("freda")
	if err != nil {
        log.Fatalf("%s", err)
	}
	fmt.Printf("all attachments: %+v\n", allAttachments)

	// Get two attachments docs/transcript.fdx, docs/helloworld.txt
	twoAttachments, _ := collection.GetAttached("fred", "docs/transcript.fdx", "docs/helloworld.txt")
	fmt.Printf("two attachments: %+v\n", twoAttachments)

    // Get attached files writing them out to disc relative to your working directory
	if err := collection.GetAttachedFiles("freda"); err != nil {
        log.Fatalf("%s", err)
	}

	// Get two selection attached files writing them out to disc relative to your working directory
	if err := collection.GetAttached("fred", "docs/transcript.fdx", "docs/helloworld.txt"); err != nil {
        log.Fatalf("%s", err)
	}

    // Remove docs/transcript.fdx and docs/helloworld.txt from freda.json attachments
	if err := collection.Detach("fred", "docs/transcript.fdx", "docs/helloworld.txt"); err != nil {
        log.Fatalf("%s", err)
	}

	// Remove all attached files from freda.json
	if err := collection.Detach("fred")
        log.Fatalf("%s", err)
	}

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2018, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	// Version of the dataset package
	Version = `v0.0.32-dev`

	// License is a formatted from for dataset package based command line tools
	License = `` /* 1530-byte string literal not displayed */

	DefaultAlphabet = `abcdefghijklmnopqrstuvwxyz`

	ASC  = iota
	DESC = iota
)

Variables

View Source
var DefaultBucketNames = []string{}/* 676 elements not displayed */

DefaultBucketNames provides a A-Z list of bucket names with a length 2

Functions

func Analyzer added in v0.0.3

func Analyzer(collectionName string) error

Analyzer checks a collection for problems

+ checks if collection.json exists and is valid + checks version of collection and version of dataset tool running + checks if all collection.buckets exist + checks for unaccounted for buckets + checks if all keys in collection.keymap exist + checks for unaccounted for keys in buckets + checks for keys in multiple buckets and reports duplicate record modified times

func CSVFormatter added in v0.0.3

func CSVFormatter(out io.Writer, results *bleve.SearchResult, colNames []string, skipHeaderRow bool) error

CSVFormatter writes out CSV representation using encoding/csv

func Deindexer added in v0.0.33

func Deindexer(idxName string, keys []string, batchSize int) error

Deindexer deletes the keys from an index. Returns an error if a problem occurs.

func Delete

func Delete(name string) error

Delete an entire collection

func Find

func Find(idxAlias bleve.IndexAlias, queryString string, options map[string]string) (*bleve.SearchResult, error)

Find takes a Bleve index name and query string, opens the index, and writes the results to the os.File provided. Function returns an error if their are problems.

func Formatter added in v0.0.3

func Formatter(out io.Writer, results *bleve.SearchResult, tmpl *template.Template, tName string, pageData map[string]string) error

Formatter writes out a format based on the specified template name merging any additional pageData provided

func JSONFormatter added in v0.0.3

func JSONFormatter(out io.Writer, results *bleve.SearchResult, prettyPrint bool) error

JSONFormatter writes out JSON representation using encoding/json

func Repair added in v0.0.3

func Repair(collectionName string) error

Repair will take a collection name and attempt to recreate valid collection.json from content in discovered buckets and attached documents

Types

type Attachment

type Attachment struct {
	// Name is the filename and path to be used inside the generated tar file
	Name string `json:"name"`
	// Body is a byte array for storing the content associated with Name
	Body []byte `json:"-"`
	// Size
	Size int64 `json:"size"`
}

Attachment is a structure for holding non-JSON content you wish to store alongside a JSON document in a collection

type Collection

type Collection struct {
	// Version of collection being stored
	Version string `json:"version"`
	// Name of collection
	Name string `json:"name"`
	// Buckets is a list of bucket names used by collection
	Buckets []string `json:"buckets"`
	// KeyMap holds the document name to bucket map for the collection
	KeyMap map[string]string `json:"keymap"`
	// Store holds the storage system information (e.g. local disc, S3, GS)
	// and related methods for interacting with it
	Store *storage.Store `json:"-"`
	// FullPath is the fully qualified path on disc or URI to S3 or GS bucket
	FullPath string `json:"-"`
}

Collection is the container holding buckets which in turn hold JSON docs

func InitCollection added in v0.0.8

func InitCollection(name string) (*Collection, error)

InitCollection - creates a new collection with default alphabet and names of length 2.

func Open

func Open(name string) (*Collection, error)

Open reads in a collection's metadata and returns and new collection structure and err

func (*Collection) AttachFile added in v0.0.33

func (c *Collection) AttachFile(keyName, fName string, buf io.Reader) error

AttachFile is for attaching a single non-JSON document to a dataset record. It will replace ANY existing attached content (i.e. it creates an new tarball holding on this single document) This is a limitation of our storage package supporting a minimal set of operation across all storage environments (e.g. S3/Google Cloud Storage do not support append to file, only replacement). It takes the document key, name and an io.Reader reading content in and appending the results to the tar file updating the internal _Attributes metadata as needed.

func (*Collection) AttachFiles

func (c *Collection) AttachFiles(name string, fileNames ...string) error

AttachFiles attaches non-JSON documents to a JSON document in the collection. Attachments are stored in a tar file, if tar file exits then attachment(s) are appended to tar file.

func (*Collection) Attachments

func (c *Collection) Attachments(name string) ([]string, error)

Attachments returns a list of files in the attached tarball for a given name in the collection

func (*Collection) Close

func (c *Collection) Close() error

Close closes a collection, writing the updated keys to disc

func (*Collection) Create

func (c *Collection) Create(name string, data map[string]interface{}) error

Create a JSON doc from an map[string]interface{} and adds it to a collection, if problem returns an error name must be unique. Document must be an JSON object (not an array).

func (*Collection) CreateJSON added in v0.0.33

func (c *Collection) CreateJSON(key string, src []byte) error

CreateJSON adds a JSON doc to a collection, if a problem occurs it returns an error

func (*Collection) Deindexer added in v0.0.33

func (c *Collection) Deindexer(idxName string, keys []string, batchSize int) error

Deindexer removes items from an index on a collection.

func (*Collection) Delete

func (c *Collection) Delete(name string) error

Delete removes a JSON doc from a collection

func (*Collection) DocPath

func (c *Collection) DocPath(name string) (string, error)

DocPath returns a full path to a key or an error if not found

func (*Collection) ExportCSV added in v0.0.3

func (c *Collection) ExportCSV(fp io.Writer, eout io.Writer, filterExpr string, dotExpr []string, colNames []string, verboseLog bool) (int, error)

ExportCSV takes a reader and iterates over the rows and exports then as a CSV file

func (*Collection) Extract added in v0.0.3

func (c *Collection) Extract(filterExpr string, dotExpr string) ([]string, error)

Extract takes a collection, a filter and a dot path and returns a list of unique values E.g. in a collection article records extracting orcid ids which are values in a authors field

func (*Collection) GetAttachedFiles

func (c *Collection) GetAttachedFiles(name string, filterNames ...string) error

GetAttachedFiles returns an error if encountered, side effect is to write file to destination directory If no filterNames provided then return all attachments or error

func (*Collection) HasKey added in v0.0.3

func (c *Collection) HasKey(key string) bool

HasKey returns true if key is in collection's KeyMap, false otherwise

func (*Collection) ImportCSV added in v0.0.3

func (c *Collection) ImportCSV(buf io.Reader, skipHeaderRow bool, idCol int, useUUID bool, verboseLog bool) (int, error)

ImportCSV takes a reader and iterates over the rows and imports them as a JSON records into dataset.

func (*Collection) ImportTable added in v0.0.4

func (c *Collection) ImportTable(table [][]string, skipHeaderRow bool, idCol int, useUUID, overwrite, verboseLog bool) (int, error)

ImportTable takes a [][]string and iterates over the rows and imports them as a JSON records into dataset.

func (*Collection) Indexer

func (c *Collection) Indexer(idxName string, idxMapName string, keys []string, batchSize int) error

Indexer generates or updates and a Bleve index based on the index map filename provided, a list of keys and batch size.

func (*Collection) KeyFilter added in v0.0.33

func (c *Collection) KeyFilter(keyList []string, filterExpr string) ([]string, error)

KeyFilter takes a list of keys and filter expression and returns the list of keys passing through the filter or an error

func (*Collection) KeySortByExpression added in v0.0.33

func (c *Collection) KeySortByExpression(keys []string, expr string) ([]string, error)

KeySortByExpression takes a array of keys and a sort expression and turns a sorted list of keys.

func (*Collection) Keys

func (c *Collection) Keys() []string

Keys returns a list of keys in a collection

func (*Collection) Length added in v0.0.6

func (c *Collection) Length() int

Length returns the number of keys in a collection

func (*Collection) Prune added in v0.0.33

func (c *Collection) Prune(name string, filterNames ...string) error

Prune a non-JSON document from a JSON document in the collection.

func (*Collection) Read

func (c *Collection) Read(name string, data map[string]interface{}) error

Read finds the record in a collection, updates the data interface provide and if problem returns an error name must exist or an error is returned

func (*Collection) ReadJSON added in v0.0.33

func (c *Collection) ReadJSON(name string) ([]byte, error)

ReadJSON finds a the record in the collection and returns the JSON source

func (*Collection) Update

func (c *Collection) Update(name string, data map[string]interface{}) error

Update JSON doc in a collection from the provided data interface (note: JSON doc must exist or returns an error )

func (*Collection) UpdateJSON added in v0.0.33

func (c *Collection) UpdateJSON(name string, src []byte) error

UpdateJSON a JSON doc in a collection, returns an error if there is a problem

type IndexList added in v0.0.33

type IndexList struct {
	Names   []string
	Fields  []string
	Alias   bleve.IndexAlias
	Indexes []bleve.Index
}

func OpenIndexes added in v0.0.3

func OpenIndexes(indexNames []string) (*IndexList, []string, error)

OpenIndexes opens a list of index names and returns an index alias, a combined list of fields and error

func (*IndexList) Close added in v0.0.33

func (idxList *IndexList) Close() error

Close removes all the indexes from a list associated with idx.Alias, then closes the related indexes. idx.Alias.Remove(idx.Indexes) returning error

type KeyValue added in v0.0.7

type KeyValue struct {
	// JSON Record ID in collection
	ID string
	// The value of the field to be sorted from record
	Value interface{}
}

type KeyValues added in v0.0.7

type KeyValues []KeyValue

func (KeyValues) Len added in v0.0.7

func (a KeyValues) Len() int

func (KeyValues) Less added in v0.0.7

func (a KeyValues) Less(i, j int) bool

func (KeyValues) Swap added in v0.0.7

func (a KeyValues) Swap(i, j int)

Directories

Path Synopsis
analyzers
cmd
dataset
dataset is a command line utility to manage content stored in a dataset collection.
dataset is a command line utility to manage content stored in a dataset collection.
dsws
dsws.go - A web server/service for hosting dataset search and related static pages.
dsws.go - A web server/service for hosting dataset search and related static pages.
gsheets.go is a part of the dataset package written to allow import/export of records to/from dataset collections.
gsheets.go is a part of the dataset package written to allow import/export of records to/from dataset collections.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL