bagman

module
v0.0.0-...-9369c8b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2016 License: Apache-2.0

README

bagman

Server side code for managing BagIt bags sent to and managed by APTrust.

Easy Vagrant Setup

For an easy setup using Vagrant, see the Vagrant setup instructions in the ansible-playboooks repo.

Prerequisites

Install libmagic first. If you're on a Mac, do it this way:

	brew install libmagic

If you're on Linux, use apt-get or yum to install libmagic-dev:

	sudo apt-get install libmagic-dev

Install these prerequisites as well:

	go get launchpad.net/goamz
	go get github.com/nu7hatch/gouuid
	go get github.com/bitly/go-nsq
	go get github.com/rakyll/magicmime
	go get github.com/op/go-logging
	go get github.com/APTrust/bagins

Installation

It's as easy as:

	go get github.com/APTrust/bagman

Configuration

You must have the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your environment for bagman to connect to S3.

You can create a named custom configuration in config.json, and then run bagman with that named configuration using the scripts in the scripts directory. See the section on Running Locally with NSQ below.

Testing

You can run basic unit tests by changing into the bagman/bagman directory and running this:

	go test

... but because bagman has to work with NSQ, S3 and a Rails app called Fluctus, the integration tests are much more useful and informative. See the Testing Documentation for more information about unit tests and integration tests.

Running Locally with NSQ

See the run_demo.sh script in the scripts directory. This runs all of the services required for bag processing, using the "demo" configuration settings in the config.json file. To run everything locally, you can set up your own config secion in config.json, then run one of the scripts in the scripts directory, using your newly created config.

The config settings you'll need to change to run locally are TarDirectory, LogDirectory and RestoreDirectory. These should all point to existing directories in which you have read/write access. You should also set MaxFileSize to a relatively low value (100000 - 250000), so that you only fetch and process relatively small files from the S3 receiving buckets. If MaxFileSize is set too high, or if it's set to zero, which means no limit, you will wind up downloading and processing an enormous amount of data.

The scripts directory also includes process_items.sh and restore_items.sh. These are good for running local end-to-end integration tests. To run these on your own machine, you'll have to change the config setting in those scripts from dev to whatever config name you've set up. You also must have Fluctus running locally (or at whatever URL the FluctusURL setting points to).

The process_items.sh script runs the bucket reader to look for items in the receiving buckets, then ingests them. The restore_items.sh script looks for items marked for delete and restore, and deletes or restores them.

A good way to run end-to-end tests locally is to run process_items.sh, then go to http://localhost:3000 and mark a few intellectual objects for delete and a few for restore. Then kill process_items.sh and run restore_items.sh. Both scripts print occasional STATS lines that show the number of items that have processed successfully and the number that have failed.

NSQ

The nsq folder contains service.go, which is a wrapper around NSQ. It also contains config files with settings for the NSQ demon. See the README in the nsq directory for more information.

APTrust Apps

Bag processing, deletion and restoration requires a number of apps. Here's a breakdown of the apps and what they do:

apt_bag_delete - Delete Bags After Ingest

apps/bag_delete deletes bags from partners' receiving buckets after the bags have been successfully ingested.

apt_failed_fixity - Record Failed Fixity Events

apps/failed_fixity records information about failed attempts to run fixity checks on items in the preservation bucket.

apt_failed_replication - Record Failed Replication Events

apps/failed_replication records information about failed attempts to copy files from the preservation bucket in the US East region to the replication bucket in the US West 2 region.

apt_file_delete - Delete Generic Files from Preservation

*apps/apt_file_delete deletes files from the perservation bucket. This is done only at the request of the institution that owns the files.

apt_fixity - Run Fixity Checks

*apps/apt_fixity runs periodic fixity checks on files in the preservation storage bucket.

apt_prepare - Prepare Bags for Ingest

apps/apt_prepare handles the first stages on ingest. It reads from NSQ's prepare_channel, which often contains duplicate entries. apt_prepare throws out duplicates, then downloads tar files from the receiving buckets, unpacks them, and validates them. It makes an entry for each valid bag in NSQ's store_topic. Invalid bags are recorded in Fluctus' Processed Items list as invalid. apt_prepare deletes the original tar file from the local file system when it finishes, but it leaves the untarred files for apt_store to work with.

apt_record - Record Items in Fluctus

apps/apt_record reads from NSQ's metadata_channel, which contains information about where all of a bag's files have been stored. It records this data in Fluctus (Fedora), creating or updating the Intellectual Object, the Generic Files and Premis Events. apt_record puts successfully ingested items into NSQ's cleanup_topic.

apt_replicate - Copy Ingested Files To Oregon

apps/apt_replicate copies items from the preservation bucket in Virginia to the replication bucket in Oregon.

apt_restore - Restore Intellectual Objects

apps/apt_restore reassembles Intellectual Objects into APTrust bags and copies them to the restoration bucket of the partner who owns the object.

apt_retry - Re-attempt Recording in Fluctus

apps/apt_retry is a manually-run app that uses JSON data dumped from the trouble queue to re-record ingest data in Fluctus. It can create complete, consistent records in Fluctus about any ingested bag whose Fluctus data was recorded incompletely or not at all. This tool is rarely needed. A typical use case occurs when a bag exposes a new bug in Fluctus that prevents proper recording of metadata. Once the bug is fixed, the object's metadata can be rebuilt in Fluctus from the JSON data in the trouble queue.

apt_store - Store Ingested Files in the Preservation Bucket

apps/apt_store reads from the store_channel and stores the generic files unpacked by apt_prepare in the S3 preservation bucket (aka long-term storage). If it's not able to store all the files in S3, it records an error in Fluctus' Processed Items list. Otherwise, it creates an entry in NSQ's metadata_topic so that data about the bag can be recorded in Fluctus. apt_store deletes all of the untarred contents of the bag from the local file system when it finishes.

apt_trouble - Record Information About Failed Ingest

apps/apt_trouble dumps information about failed ingest attempts into JSON files inside the log directory of the Go server. This data is useful for diagnosing bugs and for the apt_retry utility described above.

Bucket Reader

apps/bucket_reader runs as a cron job on the Go utility server. It scans all of the partner intake buckets, then checks the Processed Items list in Fluctus to see which of these items have not yet started the ingest process. It adds unprocessed tar files to the prepare_topic in NSQ.

Downloader

apps/downloader is a simple command-line utility for downloading files from any APTrust bucket to your local machine. You must have the proper APTrust AWS keys set in your environment to use this.

Cleanup Reader

apps/cleanup_reader is a cron job that gathers information from Fluctus about what bags have been ingested recently. It adds successfully-ingested bags to NSQ's bag_delete topic, so that apt_bag_delete can remove them from the receiving buckets.

Fixity Reader

apps/fixity_reader is a cron job that checks Fluctus once a day for files that need a fixity check. It copies information about these files into NSQ's fixity_topic.

Multiclean

apps/multiclean finds and deletes fragments of failed multi-part S3 uploads. We run this periodically to avoid having to pay S3 storage charges on inaccessible file fragments.

Read Test

apps/readtest reads an APTrust bag from the local file system and prints out information about what's in it (files, tags, etc.) This can be useful for examining bags and diagnosing problems. See also the partner app called apt_validate.

Request Reader

apps/request_reader runs as a cron job on the Go utility server. It checks the Processed Items list in Fluctus for 1) intellectual objects that users want to restore, and 2) generic files that users want to delete. It puts these items into NSQ's restore_topic and delete_topic.

Restoration

apps/apt_restore reads from NSQ's restore_channel and restores intellectual objects. It fetches all of the object's generic files from the preservation bucket, reassembles them into a bag (a tar file), and puts them into the owning institution's restoration bucket. (If the CustomRestoreBucket option in config.json is set to some other bucket name, the object will be restored to that bucket.) Multi-part bags will not be divided into the same units as the original upload, but the entire bag, once reassembled from the parts will be complete. (E.g. A bag may have been uploaded in 100 parts, and it may be restored in 20 parts. None of the restored parts will match the original parts, but the restored whole will match the original whole.)

APTrust Partner Apps

The partner-apps directory contains apps intended for use by APTrust partners. Note that these apps can access only the partner's own receiving and restoration buckets. These apps require a config file, which is described by providing the --help flag to any of the apps.

Most of these apps require a configuration file, which you can specify explicitly on the command line with the --config flag, or implicitly by creating a ~/.aptrust_partner.conf file.

The apps include the following:

apt_delete

partner-apps/apt_delete enables partners to download restored bags from their restoration buckets.

apt_download

partner-apps/apt_download enables partners to download restored bags from their restoration buckets.

apt_list

partner-apps/apt_list enables partners to list the contents of their receiving and restoration buckets.

apt_upload

partner-apps/apt_upload enables partners to upload bags to their receiving buckets for ingest.

apt_validate

partner-apps/apt_validate enables partners to validate bags (tarred or untarred) before uploading them for ingest.

Directories

Path Synopsis
apps
bucket_reader
BucketReader gets a list of files awaiting processing in the S3 intake buckets and adds each item to the queue for processing.
BucketReader gets a list of files awaiting processing in the S3 intake buckets and adds each item to the queue for processing.
fixity_reader
fixity_reader periodically queries Fluctus for GenericFiles that haven't had a fixity check in X days.
fixity_reader periodically queries Fluctus for GenericFiles that haven't had a fixity check in X days.
multiclean
multiclean.go cleans up fragments from failed S3 multipart-uploads.
multiclean.go cleans up fragments from failed S3 multipart-uploads.
request_reader
request_reader periodically checks the processed items list in Fluctus for the following: 1.
request_reader periodically checks the processed items list in Fluctus for the following: 1.
requeue
requeue puts a blob of json into the specified queue so that we can reprocess it.
requeue puts a blob of json into the specified queue so that we can reprocess it.
Common vars and constants, shared by many parts of the bagman library.
Common vars and constants, shared by many parts of the bagman library.
bagrecorder records bag metadata in Fluctus.
bagrecorder records bag metadata in Fluctus.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL