dataset

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 13, 2021 License: BSD-3-Clause Imports: 25 Imported by: 0

README

Dataset Project

DOI

The Dataset Project provides tools for working with collections of JSON Object documents stored on the local file system. Two tools are provided.

dataset command line tool

dataset is a command line tool for working with collections of JSON objects. Collections are stored on the file system. JSON objects are stored in collections as plain UTF-8 text files. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection; and associating non-JSON documents (attachments) with specific JSON documents in the collection.

enhanced features include
  • aggregate objects into data frames
  • import, export and synchronize JSON objects to and from CSV files
  • generate sample sets of keys and objects

See Getting started with dataset for a tour and tutorial.

dataset as a web service

datasetd is a web service implementation of the dataset command line program. It features a sub-set of capability found in the command line tool. This allows dataset collections to be integrated safely into other web applications or used by multiple processes.

Design choices

dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way.

dataset and datasetd are guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. dataset is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds creates a new collection called 'mycollection.ds').

  • dataset and datasetd store JSON object documents in collections
    • collections are folder(s) containing
      • collection.json metadata file describing the collection and keys
      • a pairtree of JSON object documents
      • non-JSON attachments can be associated with a JSON document and found in a semver (semantic version number) named sub directory

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features

dataset supports

  • Listing Keys in a collection
  • Object level actions
  • Import and export of CSV files
  • The ability to reshape data by performing simple object joins
  • The ability to create data frames from while collections or based on keys lists
    • frames are defined using dot paths describing what is to be pulled out of a stored JSON objects

datasetd supports

Both dataset and datasetd maybe useful for general data science applications needing intermediate JSON object management but not a full blown database or repository system.

Limitations of dataset and datasetd

dataset has many limitations, some are listed below

  • it is not a multi-process, multi-user data store
  • it is not a general purpose database system
  • it does not supply automatic version control on collections, objects or attachments
  • it stores all keys to lower case in order to deal with file systems that are not case sensitive
  • it does not have a built-in query language, search or sorting
  • it should NOT be used for sensitive or secret information

datasetd is a simple web service intended to run on "localhost:8485".

  • it is not a RESTful service
  • it does not include support for authentication
  • it does not support a query language, search or sorting
  • it does not support data frames
  • it does not support access control by users or roles
  • it does not provide auto key generation or versioning
  • it limits the size of JSON documents stored to less than 1 MiB
  • it limits the size of attached files to less than 250 MiB
  • it does not support partial JSON record updates or retrieval
  • it does not provide an interactive Web UI for working with dataset collections
  • it does not support HTTPS or "at rest" encryption
  • it should NOT be used for sensitive or secret information

Authors and history

  • R. S. Doiel
  • Tommy Morrell

Releases

Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 10 (x86) and Raspberry Pi OS (ARM7).

github.com/caltechlibrary/dataset/releases

You can use dataset from Python via the py_dataset package.

Documentation

Overview

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset provides a common approach for storing JSON object documents on local disc. It is intended as a single user system for intermediate processing of JSON content for analysis or batch processing. It is not a database management system (if you need a JSON database system I would suggest looking at Couchdb, Mongo and Redis as a starting point).

The approach dataset takes is to store JSON documents in a pairtree structure under the collection folder. The keys are the JSON document names. JSON documents (and possibly their attachments) are then stored based on that assignment in the pairtree. Conversely the collection.json document is used to find and retrieve documents from the collection. The layout of the metadata is as follows

+ Collection - a directory

  • Collection/collection.json - metadata for retrieval
  • Collection/[Pairtree] - holds individual JSON docs and attachments

A key feature of dataset is to be Posix shell friendly. This has lead to storing the JSON documents in a directory structure that standard Posix tooling can traverse. It has also mean that the JSON documents themselves remain on "disc" as plain text. This has facilitated integration with many other applications, programming langauages and systems.

Attachments are non-JSON documents explicitly "attached" that share the same pairtree path but are placed in a sub directory called "_". If the document name is "Jane.Doe.json" and the attachment is photo.jpg the JSON document is "pairtree/Ja/ne/.D/e./Jane.Doe.json" and the photo is in "pairtree/Ja/ne/.D/e./_/photo.jpg".

Additional operations beside storing and reading JSON documents are also supported. These include creating lists (arrays) of JSON documents from a list of keys, listing keys in the collection, counting documents in the collection, indexing and searching by indexes.

The primary use case driving the development of dataset is harvesting API content for library systems (e.g. EPrints, Invenio, ArchivesSpace, ORCID, CrossRef, OCLC). The harvesting needed to be done in such a way as to leverage existing Posix tooling (e.g. grep, sed, etc) for processing and analysis.

Initial use case:

Caltech Library has many repository, catelog and record management systems (e.g. EPrints, Invenion, ArchivesSpace, Islandora, Invenio). It is common practice to harvest data from these systems for analysis or processing. Harvested records typically come in XML or JSON format. JSON has proven a flexibly way for working with the data and in our more modern tools the common format we use to move data around. We needed a way to standardize how we stored these JSON records for intermediate processing to allow us to use the growing ecosystem of JSON related tooling available under Posix/Unix compatible systems.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2021, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (
	// Asc is used to identify ascending sorts
	Asc = iota
	// Desc is used to identify descending sorts
	Desc = iota
)
View Source
const (

	// License is a formatted from for dataset package based command line tools
	License = `` /* 1530-byte string literal not displayed */

)
View Source
const Version = "1.1.0"

Version of package

Variables

This section is empty.

Functions

func Analyzer added in v0.0.3

func Analyzer(collectionName string, verbose bool) error

Analyzer checks the collection version and analyzes current state of collection reporting on errors.

func DecodeJSON added in v0.1.0

func DecodeJSON(src []byte, obj *map[string]interface{}) error

DecodeJSON provides a common method for decoding data for use in Dataset.

func DisplayLicense added in v1.1.0

func DisplayLicense(out io.Writer, appName string, license string)

DisplayLicense returns the license associated with dataset application.

func DisplayUsage added in v1.1.0

func DisplayUsage(out io.Writer, appName string, flagSet *flag.FlagSet, description string, examples string, license string)

DisplayUsage displays a usage message.

func DisplayVersion added in v1.1.0

func DisplayVersion(out io.Writer, appName string)

DisplayVersion returns the of the dataset application.

func EncodeJSON added in v0.1.0

func EncodeJSON(obj map[string]interface{}) ([]byte, error)

EncodeJSON provides a common method for encoding data for use in Dataset.

func InitDatasetAPI added in v1.1.0

func InitDatasetAPI(settings string) error

InitDatasetAPI initializes the web service by reading in a configuration file. You still need to call RunDatasetAPI to start the service.

func IsCollection added in v0.0.45

func IsCollection(p string) bool

IsCollection checks to see if a given path contains a collection.json file

func Repair added in v0.0.3

func Repair(collectionName string, verbose bool) error

Repair takes a collection name and calls walks the pairtree and repairs collection.json as appropriate.

func RunDatasetAPI added in v1.1.0

func RunDatasetAPI(appName string) error

RunDatasetAPI runs a dataset web service. It is the heart of datasetd.

func Shutdown added in v1.1.0

func Shutdown(appName string) int

Shutdown shutdowns the dataset web service started with RunDatasetAPI.

Types

type Attachment

type Attachment struct {
	// Name is the filename and path to be used inside the generated tar file
	Name string `json:"name"`

	// Size remains to to help us migrate pre v0.0.61 collections.
	// It should reflect the last size added.
	Size int64 `json:"size"`

	// Sizes is the sizes associated with the version being attached
	Sizes map[string]int64 `json:"sizes"`

	// Current holds the semver to the last added version
	Version string `json:"version"`

	// Checksum, current implemented as a MD5 checksum for now
	// You should have one checksum per attached version.
	Checksums map[string]string `json:"checksums"`

	// HRef points at last attached version of the attached document, e.g. v0.0.0/photo.png
	// If you moved an object out of the pairtree it should be a URL.
	HRef string `json:"href"`

	// VersionHRefs is a map to all versions of the attached document
	// {
	//    "v0.0.0": "... /photo.png",
	//    "v0.0.1": "... /photo.png",
	//    "v0.0.2": "... /photo.png"
	// }
	VersionHRefs map[string]string `json:"version_hrefs"`

	// Created a date string in RTC3339 format
	Created string `json:"created"`

	// Modified a date string in RFC3339 format
	Modified string `json:"modified"`

	// Metadata is a map for application specific metadata about attachments.
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

Attachment is a structure for holding non-JSON content metadata you wish to store alongside a JSON document in a collection

type Collection

type Collection struct {
	// DatasetVersion of the collection
	DatasetVersion string `json:"dataset,omitempty"`

	// Name (filename) of collection
	Name string `json:"name"`

	// KeyMap holds the document key to path in the collection
	KeyMap map[string]string `json:"keymap,omitempty"`

	// FrameMap is a list of frame names and with rel path to the frame defined in the collection
	FrameMap map[string]string `json:"frames,omitempty"`

	// Description describes what is in the collection.
	Description string `json:"description,omitempty"`

	// Created is the date/time the init command was run in
	// RFC1123 format.
	Created string `json:"created,omitempty"`

	// Version of collection being stored in semvar notation
	Version string `json:"version,omitempty"`

	// Contact info
	Contact string `json:"contact,omitempty"`

	// Author holds a list of PersonOrOrg
	Author []*PersonOrOrg `json:"author,omitempty"`

	// Contributors holds a list of PersonOrOrg
	Contributor []*PersonOrOrg `json:"contributor,omitempty"`

	// Funder holds a list of PersonOrOrg
	Funder []*PersonOrOrg `json:"funder,omitempty"`

	// DOI holds the digital object identifier if defined.
	DOI string `json:"doi,omitempty"`

	// License holds a pointer to the license information for
	// the collection. E.g. CC0 URL
	License string `json:"license,omitempty"`

	// Annotation is a map to any addition metadata associated with
	// the Collection's metadata.
	Annotation map[string]interface{} `json:"annotation,omitempty"`

	// Who is the person(s)/organization(s) that created the collection
	Who []string `json:"who,omitempty"`
	// What - description of collection
	What string `json:"what,omitempty"`
	// When - date associated with collection (e.g. 2021,
	// 2021-10, 2021-10-02), should map to an approx date like in
	// archival work.
	When string `json:"when,omitempty"`
	// Where - location (e.g. URL, address) of collection
	Where string `json:"where,omitempty"`
	// contains filtered or unexported fields
}

Collection is the container holding a pairtree containing JSON docs

func Init added in v1.1.0

func Init(name string) (*Collection, error)

Init - creates a new collection and opens it. Like Open it creates a "lock.pid" file in the root of the collection. An initialized collection should be closed to clear the lock.

```

var (
   c *Collection
   err error
)
c, err = dataset.Init("collection.ds")
if err != nil {
  // ... handle error
}
defer c.Close()

```

func Open

func Open(name string) (*Collection, error)

Open reads in a collection's metadata and returns and new collection structure or error. It creates a "lock.pid" file in the collection's root. An opened collection should be closed to clear the lock.

```

var (
   c *Collection
   err error
)
c, err = dataset.Open("collection.ds")
if err != nil {
   // ... handle error
}
defer c.Close()

```

func (*Collection) AttachFile added in v0.0.33

func (c *Collection) AttachFile(keyName string, semver string, fullName string) error

AttachFile is for attaching a single non-JSON document to a dataset record. It will replace ANY existing attached content with the same semver and basename.

func (*Collection) AttachFileAs added in v1.1.0

func (c *Collection) AttachFileAs(keyName string, semver string, dstName string, srcName string) error

AttachFileAs is for attaching a single non-JSON document to a dataset record with a specific attachment name. It will replace ANY existing attached content with the same semver and destintation name.

func (*Collection) AttachFiles

func (c *Collection) AttachFiles(keyName string, semver string, fileNames ...string) error

AttachFiles attaches non-JSON documents to a JSON document in the collection. Attachments are stored in a tar file, if tar file exits then attachment(s) are appended to tar file.

func (*Collection) AttachStream added in v0.0.63

func (c *Collection) AttachStream(keyName, semver, fullName string, buf io.Reader) error

AttachStream is for attaching open a non-JSON file buffer (via an io.Reader).

func (*Collection) AttachmentPath added in v1.1.0

func (c *Collection) AttachmentPath(keyName string, semver string, filename string) (string, error)

AttachmentPath takes a key, semver and filename and returns the path to the attached file (if found).

func (*Collection) Attachments

func (c *Collection) Attachments(keyName string) ([]string, error)

Attachments returns a list of files and size attached for a key name in the collection

func (*Collection) Clone added in v0.0.39

func (c *Collection) Clone(cloneName string, keys []string, verbose bool) error

Clone copies the current collection records into a newly initialized collection given a list of keys and new collection name. Returns an error value if there is a problem. NOTE: Clone does NOT copy attachments only the JSON records.

func (*Collection) CloneSample added in v0.0.39

func (c *Collection) CloneSample(trainingCollectionName string, testCollectionName string, keys []string, sampleSize int, verbose bool) error

CloneSample takes the current collection, a sample size, a training collection name and a test collection name. The training collection will be created and receive a random sample of the records from the current collection based on the sample size provided. Sample size must be greater than zero and less than the total number of records in the current collection.

If the test collection name is not an empty string it will be created and any records not in the training collection will be cloned from the current collection into the test collection.

func (*Collection) Close

func (c *Collection) Close() error

Close closes a collection, writing the updated keys to disc Close removes the "lock.pid" file in the collection root. Close is often called in conjunction with "defer" keyword.

```

c, err := dataset.Open("my_collection.ds")
if err != nil { /* .. handle error ... */ }
/* do some stuff with the collection */
if err := c.Close(); err != nil {
   /* ... handle closing error ... */
}

```

func (*Collection) Create

func (c *Collection) Create(name string, data map[string]interface{}) error

Create a JSON doc from an map[string]interface{} and adds it to a collection, if problem returns an error name must be unique. Document must be an JSON object (not an array).

```

var (
   c *Collection
)
/* ... collection previously opened and assigned to "c" ... */
key := "object-2"
obj := map[]interface{}{
    "one": 2,
    "two": 3,
    "four": 4,
}
if err := c.Create(key, obj); err != nil {
   /* ... handle error ... */
}

```

func (*Collection) CreateJSON added in v0.0.33

func (c *Collection) CreateJSON(key string, src []byte) error

CreateJSON adds a JSON doc to a collection, if a problem occurs it returns an error. It requires a collection to be (e.g. Open or Init)

```

var (
   c *Collection
)
/* ... collection previously opened and assigned to "c" ... */
src := []byte(`{"one": 1}`)
key := "object-1"
if err := c.CreateJSON(key, src); err != nil {
   /* ... handle error ... */
}

```

func (*Collection) CreateObjectsJSON added in v0.0.70

func (c *Collection) CreateObjectsJSON(keyList []string, src []byte) error

CreateObjectsJSON takes a list of keys and creates a default object for each key as quickly as possible. This is useful in vary narrow situation like quickly creating test data. Use with caution.

NOTE: if object already exist creation is skipped without reporting an error.

func (*Collection) Delete

func (c *Collection) Delete(name string) error

Delete removes a JSON doc from a collection

```

var (
   c *dataset.Collection
)
/* ... collection previously opened and assigned to "c" ... */

key := "object-1"
if err := c.Delete(key); err != nil {
    /* ... handle error ... */
}

```

func (*Collection) DocPath

func (c *Collection) DocPath(name string) (string, error)

DocPath returns a full path to a key or an error if not found

```

c, err := dataset.Open("my_collection.ds")
if err != nil { /* ... handle error ... */ }
defer c.Close()
key := "my-object-key"
docPath := c.DocPath(key)

```

func (*Collection) ExportCSV added in v0.0.3

func (c *Collection) ExportCSV(fp io.Writer, eout io.Writer, f *DataFrame, verboseLog bool) (int, error)

ExportCSV takes a reader and frame and iterates over the objects generating rows and exports then as a CSV file

func (*Collection) ExportTable added in v0.0.47

func (c *Collection) ExportTable(eout io.Writer, f *DataFrame, verboseLog bool) (int, [][]interface{}, error)

ExportTable takes a reader and frame and iterates over the objects generating rows and exports then as a CSV file

func (*Collection) FrameClear added in v0.1.0

func (c *Collection) FrameClear(name string) error

FrameClear empties the frame's object and key lists but leaves in place the Frame definition. Use Reframe() to re-populate a frame based on a new key list.

func (*Collection) FrameCreate added in v0.1.0

func (c *Collection) FrameCreate(name string, keys []string, dotPaths []string, labels []string, verbose bool) (*DataFrame, error)

FrameCreate takes a set of collection keys, dot paths and labels builds an ObjectList and assembles additional metadata returning a new Frame associated with the collection as well as an error value. If there is a mis-match in number of labels and dot paths an an error will be returned. If the frame already exists an error will be returned.

Conceptually a frame is an ordered list of objects. Frames are associated with a collection and the objects in a frame can easily be refreshed. Frames also serve as the basis for indexing a dataset collection and provide the data paths (expressed as a list of "dot paths"), labels (aka attribute names), and type information needed for indexing and search.

If you need to update a frame's objects use FrameRefresh(). If you need to change a frame's objects or ordering use FrameReframe().

func (*Collection) FrameDelete added in v0.1.0

func (c *Collection) FrameDelete(name string) error

FrameDelete removes a frame from a collection, returns an error if frame can't be deleted.

func (*Collection) FrameExists added in v0.1.0

func (c *Collection) FrameExists(name string) bool

FrameExists checkes to see if a frame is already defined. Returns true if it exists otherwise false

func (*Collection) FrameKeys added in v1.1.0

func (c *Collection) FrameKeys(name string) []string

FrameKeys retrieves a list of keys assocaited with a data frame

func (*Collection) FrameObjects added in v0.1.0

func (c *Collection) FrameObjects(fName string) ([]map[string]interface{}, error)

FrameObjects returns a copy of a DataFrame's object list given a collection's frame name.

func (*Collection) FrameRead added in v0.1.0

func (c *Collection) FrameRead(name string) (*DataFrame, error)

FrameRead retrieves a frame from a collection. Returns the DataFrame and an error value

func (*Collection) FrameReframe added in v0.1.0

func (c *Collection) FrameReframe(name string, keys []string, verbose bool) error

FrameReframe **replaces** a frame's object list based on the keys provided.

func (*Collection) FrameRefresh added in v0.1.0

func (c *Collection) FrameRefresh(name string, verbose bool) error

FrameRefresh updates a DataFrames' object list based on the existing keys in the frame. It doesn't change the order of objects. NOTE: If an object is missing in the collection it gets pruned from the object list.

func (*Collection) Frames added in v0.0.41

func (c *Collection) Frames() []string

Frames retrieves a list of available frames associated with a collection

func (*Collection) GetAttachedFiles

func (c *Collection) GetAttachedFiles(keyName string, semver string, filterNames ...string) error

GetAttachedFiles returns an error if encountered, a side effect is the file(s) are written to the current work directory If no filterNames provided then return all attachments are written out An error value is always returned.

func (*Collection) ImportCSV added in v0.0.3

func (c *Collection) ImportCSV(buf io.Reader, idCol int, skipHeaderRow bool, overwrite bool, verboseLog bool) (int, error)

ImportCSV takes a reader and iterates over the rows and imports them as a JSON records into dataset. BUG: returns lines processed should probably return number of rows imported

func (*Collection) ImportTable added in v0.0.4

func (c *Collection) ImportTable(table [][]interface{}, idCol int, useHeaderRow bool, overwrite, verboseLog bool) (int, error)

ImportTable takes a [][]interface{} and iterates over the rows and imports them as a JSON records into dataset.

func (*Collection) IsKeyNotFound added in v0.0.69

func (c *Collection) IsKeyNotFound(e error) bool

IsKeyNotFound checks an error message and returns true if it is a key not found error.

func (*Collection) Join added in v0.0.47

func (c *Collection) Join(key string, obj map[string]interface{}, overwrite bool) error

Join takes a key, a map[string]interface{}{} and overwrite bool and merges the map with an existing JSON object in the collection. BUG: This is a naive join, it assumes the keys in object are top level properties.

func (*Collection) KeyExists added in v0.1.0

func (c *Collection) KeyExists(key string) bool

KeyExists returns true if key is in collection's KeyMap, false otherwise

var (
   c *dataset.Collection
)
/* ... collection previously opened and assigned to "c" ... */

key := "object-1"
if c.KeyExists(key) == true {
   /* ... do something with the key ... */
}

```

func (*Collection) KeySortByExpression added in v0.0.33

func (c *Collection) KeySortByExpression(keys []string, expr string) ([]string, error)

KeySortByExpression takes a array of keys and a sort expression and turns a sorted list of keys.

func (*Collection) Keys

func (c *Collection) Keys() []string

Keys returns a list of keys in a collection

var (
   c *dataset.Collection
   keys []string
)
/* ... collection previously opened and assigned to "c" ... */

keys := c.Keys()
for _, key := range keys {
   /* ... do something with the list of keys ... */
}

```

func (*Collection) Length added in v0.0.6

func (c *Collection) Length() int

Length returns the number of keys in a collection

var (
   c *dataset.Collection
)
/* ... collection previously opened and assigned to "c" ... */

l := c.Length()
/* ... do something with the number of itemsin the collection ... */

```

func (*Collection) MergeFromTable added in v0.0.47

func (c *Collection) MergeFromTable(frameName string, table [][]interface{}, overwrite bool, verbose bool) error

MergeFromTable - uses a DataFrame associated in the collection to map columns from a table into JSON object attributes saving the JSON object in the collection. If overwrite is true then JSON objects for matching keys will be updated, if false only new objects will be added to collection. Returns an error value

func (*Collection) MergeIntoTable added in v0.0.47

func (c *Collection) MergeIntoTable(frameName string, table [][]interface{}, overwrite bool, verbose bool) ([][]interface{}, error)

MergeIntoTable - uses a DataFrame associated in the collection to map attributes into table appending new content and optionally overwriting existing content for rows with matching ids. Returns a new table (i.e. [][]interface{}) or error.

func (*Collection) MetadataJSON added in v1.1.0

func (c *Collection) MetadataJSON() []byte

MetadataJSON() returns a collection's metadata fields as a JSON encoded byte array.

func (*Collection) MetadataUpdate added in v1.1.0

func (c *Collection) MetadataUpdate(meta *Collection) error

MetadataUpdate() returns update a collection's metadata fields based on a Collection data structure. You can remove

func (*Collection) ObjectList added in v0.0.61

func (c *Collection) ObjectList(keys []string, dotPaths []string, labels []string, verbose bool) ([]map[string]interface{}, error)

ObjectList (on a collection) takes a set of collection keys and builds an ordered array of objects from the array of keys, dot paths and labels provided.

func (*Collection) Prune added in v0.0.33

func (c *Collection) Prune(keyName string, semver string, filterNames ...string) error

Prune a non-JSON document from a JSON document in the collection.

func (*Collection) Read

func (c *Collection) Read(name string, data map[string]interface{}, cleanObject bool) error

Read finds the record in a collection, updates the data interface provide and if problem returns an error name must exist or an error is returned

```

var (
   c *dataset.Collection
)
/* ... collection previously opened and assigned to "c" ... */
key := "object-2"
obj, err := c.Read(key)
if err != nil { /* ... handle error ... */  }

```

func (*Collection) ReadJSON added in v0.0.33

func (c *Collection) ReadJSON(name string) ([]byte, error)

ReadJSON finds a the record in the collection and returns the JSON source or an error.

```

var (
   c *Collection
)
/* ... collection previously opened and assigned to "c" ... */
key := "object-1"
src, err := c.ReadJSON(key)
if err != nil {
   /* ... handle error ... */
}
/* ... do something with the JSON encoded "src" value ... */

```

func (*Collection) Save added in v1.1.0

func (c *Collection) Save() error

Save writes the collection's metadata to c.workPath This is useful for things like updating a collection's metadata.

```

c, err := dataset.Open("collection.ds")
if err != nil { /* ... handle error ... */ }
defer c.Close()
person := &PersonOrOrg{
    GivenName: "Jane",
    FaimlyName: "Doe",
    ID: "https://orcid.org/0000-0000-0000-0000",
}
funder := &PersonOrOrg {
    Name: "Example University Library",
    ID: "https://ror.org/0000000",
}

c.Author = append(c.Author, person)
c.Funder = append(c.Funder, funder)
c.Description = "This is a dataset for Jane Doe's Adventure game."
if err := c.Save(); err != nil {
    /* ... handle error ... */
}

```

func (*Collection) SaveFrame added in v0.0.47

func (c *Collection) SaveFrame(name string, f *DataFrame) error

SaveFrame saves a frame in a collection or returns an error

func (*Collection) Update

func (c *Collection) Update(name string, data map[string]interface{}) error

Update replaces a JSON doc in a collection from the provided data map to interface (note: JSON doc must exist or it returns an error )

```

var (
    c *dataset.Collection
    obj map[string]interface{}
)
/* ... collection previously opened and assigned to "c" ... */

/* ... populate our replacement obj ... */
key := "object-2"
if err := c.Update(key, obj); err != nil {
    /* ... handle error ... */
}

```

func (*Collection) UpdateJSON added in v0.0.33

func (c *Collection) UpdateJSON(name string, src []byte) error

UpdateJSON replaces a JSON doc in a collection with the JSON encoded values in the byte array. It returns an error if there is a problem. Like Update() the a record matching the key needs to exist in the collection already.

```

var (
   c *Collection
)
/* ... collection previously opened and assigned to "c" ... */
key := "object-1"
src := []byte(`{"one":1, "two": 2}`)
if err := c.Update(key, src); err != nil {
   /* ... handle error ... */
}

```

type Config added in v1.1.0

type Config struct {
	// Hostname for running service
	Hostname string `json:"host" default:"localhost:8485"`

	// Collections are defined by a COLLECTION_ID (string)
	// that points at path to where the collection is saved on file system.
	Collections map[string]*Settings `json:"collections,required"`

	// Routes are mappings of collections to supported routes.
	Routes map[string]map[string]func(http.ResponseWriter, *http.Request, string, []string) (int, error) `json:"-"`
}

Config holds a configuration file structure used by EPrints Extended API Configuration file is expected to be in JSON format.

func LoadConfig added in v1.1.0

func LoadConfig(fname string) (*Config, error)

LoadConfig reads the JSON configuration file provided, validates it and either returns a Config structure or error.

func (*Config) String added in v1.1.0

func (config *Config) String() string

type DataFrame added in v0.0.41

type DataFrame struct {
	// Explicit at creation
	Name string `json:"frame_name"`

	// CollectionName holds the name of the collection the frame was generated from. In theory you could
	// define a frame in one collection and use its results in another. A DataFrame can be rendered as a JSON
	// document.
	CollectionName string `json:"collection_name"`

	// DotPaths is a slice holding the definitions of what each Object attribute's data source is.
	DotPaths []string `json:"dot_paths"`

	// Labels are new attribute names for fields create from the provided
	// DotPaths.  Typically this is used to surface a deeper dotpath's
	// value as something more useful in the frame's context (e.g.
	// first_title from an array of titles might be labeled "title")
	Labels []string `json:"labels"`

	// NOTE: Keys is an orded list of object keys in the frame.
	Keys []string `json:"keys"`

	// NOTE: Object map privides a quick index by key to object index.
	ObjectMap map[string]interface{} `json:"object_map"`

	// Created is the date the frame is originally generated and defined
	Created time.Time `json:"created"`

	// Updated is the date the frame is updated (e.g. reframed)
	Updated time.Time `json:"updated"`
}

DataFrame is the basic structure holding a list of objects as well as the definition of the list (so you can regenerate an updated list from a changed collection). It persists with the collection.

func (*DataFrame) Grid added in v0.0.41

func (f *DataFrame) Grid(includeHeaderRow bool) [][]interface{}

Grid returns a Grid representaiton of a DataFrame's ObjectList

func (*DataFrame) Objects added in v0.0.64

func (f *DataFrame) Objects() []map[string]interface{}

Objects returns a copy of DataFrame's object list (an array of map[string]interface{})

func (*DataFrame) String added in v0.0.41

func (f *DataFrame) String() string

String renders the data structure DataFrame as JSON to a string

type Err added in v0.0.62

type Err struct {
	Msg string
}

Err holds Semver's error messages

func (*Err) Error added in v0.0.62

func (err *Err) Error() string

type KeyValue added in v0.0.7

type KeyValue struct {
	// JSON Record ID in collection
	ID string
	// The value of the field to be sorted from record
	Value interface{}
}

KeyValue holds an ID string and value interface, this lets us work with numeric keys and to sort them.

type KeyValues added in v0.0.7

type KeyValues []KeyValue

KeyValues is a list of keys (strings) to records. This type exists to allow easy sorting.

func (KeyValues) Len added in v0.0.7

func (a KeyValues) Len() int

func (KeyValues) Less added in v0.0.7

func (a KeyValues) Less(i, j int) bool

func (KeyValues) Swap added in v0.0.7

func (a KeyValues) Swap(i, j int)

type PersonOrOrg added in v1.1.0

type PersonOrOrg struct {
	// Type is either "Person" or "Organization"
	Type string `json:"@type,omitempty"`
	// ID is either an ORCID or ROR
	ID string `json:"@id,omitempty"`
	// Name of an organization, empty if person
	Name string `json:"name,omitempty"`
	// Given name for a person, empty of organization
	GivenName string `json:"givenName,omitempty"`
	// Family name for a person, empty of organization
	FamilyName string `json:"familyName,omitempty"`
	// Affiliation holds the intitution affiliation of a person.
	Affiliation []*PersonOrOrg `json:"affiliation,omitempty"`
	// Annotation holds custom fields, e.g. a grant number of a funder
	Annotation map[string]interface{} `json:"annotation,omitempty"`
}

PersonOrOrg holds a the description of a person or organizaion associated with the dataset collection. e.g. author, contributor or funder.

type Semver added in v0.0.62

type Semver struct {
	// Major version number (required, must be an integer as string)
	Major string `json:"major"`
	// Minor version number (required, must be an integer as string)
	Minor string `json:"minor"`
	// Patch level (optional, must be an integer as string)
	Patch string `json:"patch,omitempty"`
	// Suffix string, (optional, any string)
	Suffix string `json:"suffix,omitempty"`
}

Semver holds the information to generate a semver string

func ParseSemver added in v0.0.62

func ParseSemver(src []byte) (*Semver, error)

ParseSemver takes a byte slice and returns a version struct, and an error value.

func (*Semver) IncMajor added in v0.0.64

func (sv *Semver) IncMajor() error

IncMajor increments a major version number, zeros minor and patch values. Returns an error if increment fails.

func (*Semver) IncMinor added in v0.0.64

func (sv *Semver) IncMinor() error

IncMinor increments a minor version number and zeros the patch level or returns an error. Returns an error if increment fails.

func (*Semver) IncPatch added in v0.0.64

func (sv *Semver) IncPatch() error

IncPatch increments the patch level if it is numeric or returns an error.

func (*Semver) String added in v0.0.62

func (sv *Semver) String() string

func (*Semver) ToJSON added in v0.0.62

func (sv *Semver) ToJSON() []byte

ToJSON takes a version struct and returns JSON as byte slice

type Settings added in v1.1.0

type Settings struct {
	CName    string      `json:"dataset,required"`
	Keys     bool        `json:"keys" default:"false"`
	Create   bool        `json:"create" default:"false"`
	Read     bool        `json:"read" default:"false"`
	Update   bool        `json:"update" default:"false"`
	Delete   bool        `json:"delete" default:"false"`
	Attach   bool        `json:"attach" default:"false"`
	Retrieve bool        `json:"retrieve" default:"false"`
	Prune    bool        `json:"prune" default:"false"`
	DS       *Collection `json:"-"`
}

Settings holds the specific settings for a collection.

Directories

Path Synopsis
cli
* * cli is a package intended to encourage some standardization in the * command line user interface for programs developed for Caltech Library.
* * cli is a package intended to encourage some standardization in the * command line user interface for programs developed for Caltech Library.
pkgassets
pkgassets is a command line tool for harvesting directory contents (like website default files) and turning them into a Go package with a map of path and byte array of contents harvested.
pkgassets is a command line tool for harvesting directory contents (like website default files) and turning them into a Go package with a map of path and byte array of contents harvested.
cmd
dataset
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on local disc.
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on local disc.
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL