rexdeliverdataset

package module
v0.0.0-...-09900dc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 26, 2022 License: AGPL-3.0 Imports: 22 Imported by: 0

README

RexRegistry Dataset Delivery

Test Status Go Report Card

Overview

rex_deliver_dataset is a simple command-line utility for uploading file-based datasets into a RexRegistry system. It will validate the format of the files in your dataset and then upload them to the system with all the required metadata. It automates all the steps necessary to satisfy the requirements for delivering files to RexRegistry.

Installation

You can install rex_deliver_dataset onto your system in a variety of ways:

  • Download from our GitHub Releases. Every release will be available here with pre-built binaries for all the platforms we support. Unless you have a very specific reason not to, you should always download the most recent version.

  • Homebrew. We provide a custom tap that allows Homebrew users to easily install this utility on MacOS.

    $ brew tap prometheusresearch/public
    $ brew install rex_deliver_dataset
    
  • Scoop. We provide a custom bucket that allows Scoop users to easily install this utility on Windows.

    PS> scoop bucket add prometheusresearch https://github.com/prometheusresearch/scoop-public.git
    PS> scoop install rex_deliver_dataset
    
  • Compile from source. If you're comfortable building Go projects, you're welcome to retrieve the source and build it yourself.

    $ git clone https://github.com/prometheusresearch/rex_deliver_dataset
    $ cd rex_deliver_dataset
    $ make init
    $ make build
    

Usage

Once installed, you can use this utility by executing the rex_deliver_dataset command on the commandline of your system. It requires two parameters: one parameter specifies the location of the configuration file to use, the other specifies the directory that contains the files you wish to deliver. For example:

$ rex_deliver_dataset --config=my_config_file.yaml /path/to/my/files

When executed, rex_deliver_dataset will validate the structure and format of the files in your dataset according to the dataset_type specified in your configuration, and if everything is valid, it will then upload those files to the cloud storage container described in the storage section of your configuration.

If you would like to only perform the validation step on your dataset, and not actually upload anything, you can use the --validate-only parameter to do exactly that. It will report any issues it detects in your files and then terminate.

$ rex_deliver_dataset --config=my_config_file.yaml --validate-only /path/to/my/files

For more information about other parameters you can use, run rex_deliver_dataset --help.

Configuration

rex_deliver_dataset requires a configuration file in order to do its job. This configuration file is a YAML-formatted file that specifies a few properties. An example configuration is as follows:

dataset_type: "omop:5.2:csv"

storage:
  kind: s3
  container: my-bucket-name
  access_key: SOME_ACCESS_KEY
  secret_key: YOUR_SUPER_SECRET_KEY
  region: us-east-1

The properties that this configuration file supports are:

dataset_type

The dataset_type property tells the tool what kind of dataset you're trying to upload. Using this information, it will perform a series of validations on your files to ensure they are properly formatted. This property currently allows the following values:

  • omop:5.2:csv for CSV-formatted files representing OMOP CDM v5.2 tables (specifications)
storage

The storage property tells the tool where to upload the dataset to. This property consists of several child properties that contain the details of the location:

kind

The kind property specifies type type of cloud storage container that is being used. It is required, and currently allows the following values:

container

The container property specifies the name of the container that should be used. It is required.

access_key

The access_key property specifies the Access Key that will be used to access the S3 bucket that will be used. It is required when using a kind of s3.

secret_key

The secret_key property specifies the Secret Key that will be used to access the S3 bucket that will be used. It is required when using a kind of s3.

region

The region property specifies the AWS region that the S3 bucket is hosted in. It is required when using a kind of s3.

credentials_json

The credentials_json property specifies the path to the JSON file containing the GCS Application Credentials that will be used to access the Google Cloud Storage container. It is required when using a kind of gs.

Support

If you require technical support in using this tool to submit datasets to a RexRegistry system, please get in contact directly with your Registry coordinator. Do not submit an issue in this GitHub Project.

If you believe you've found a bug with this utility, or would like to request a new feature, feel free to submit a new issue in this GitHub Project.

Contributing

You can contribute to this project by forking it, making your changes, and then sending a Pull Request back to this project. To get started:

  1. Clone the repository.
  2. Run make init. This will retrieve all the dependencies of the project.
  3. Make your changes (please include tests!).
  4. Test your changes with make test.

License

rex_deliver_dataset is published under the terms of the GNU Affero General Public License v3.0.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AbsPath

func AbsPath(path string) (string, error)

func FormatBytes

func FormatBytes(size float64) string

func TimeAsISO8601

func TimeAsISO8601(t time.Time) string

Types

type Configuration

type Configuration struct {
	ConfigurationPath string
	SourcePath        string
	ExecutionTime     time.Time
	Storage           map[string]string
	DatasetType       string `yaml:"dataset_type"`
}

func NewConfiguration

func NewConfiguration() Configuration

func ReadConfig

func ReadConfig(configPath string) (Configuration, error)

func (Configuration) Validate

func (config Configuration) Validate() error

type File

type File struct {
	Name     string
	FullPath string
	Size     int64
	Hash     string
}

func CatalogDirectory

func CatalogDirectory(rootPath string) ([]File, error)

type FileReader

type FileReader struct {
	io.ReadCloser
	// contains filtered or unexported fields
}

func CreateFileReader

func CreateFileReader(path string) (*FileReader, error)

func (*FileReader) GetHash

func (fileReader *FileReader) GetHash() string

func (*FileReader) Read

func (fileReader *FileReader) Read(p []byte) (int, error)

type Manifest

type Manifest struct {
	DateCreated string         `json:"date_created"`
	DatasetType string         `json:"dataset_type"`
	Generator   string         `json:"generator"`
	Files       []ManifestFile `json:"files"`
}

func CreateManifest

func CreateManifest(config Configuration, files []File) Manifest

func (Manifest) ToJSON

func (manifest Manifest) ToJSON() ([]byte, error)

type ManifestFile

type ManifestFile struct {
	Name   string `json:"name"`
	Size   int64  `json:"size"`
	Sha512 string `json:"sha512"`
}

type Uploader

type Uploader interface {
	UploadFile(file *File) error
	UploadFiles(files []File) error
	UploadContent(name string, content []byte) error
	GetURL() string
}

func NewUploader

func NewUploader(config Configuration) (Uploader, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL