anydata

package module
v0.0.0-...-b31d7c6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 28, 2015 License: MIT Imports: 18 Imported by: 0

README

anydata

Go toolkit for handling "any" type of data and source which can be turned into a record/field structure. This is a fairly important component to any data warehouse and/or integration project. Although my specialty is bioinformatics and many of my examples are based in it, these tools are general enough to be used in many domains.

Documentation and examples at: http://godoc.org/github.com/pbnjay/anydata

Fetchers

Fetchers are used to retrieve data from a remote (or local) data source. Appropriate Fetcher instances are automatically returned by GetFetcher based on a provided URL string.

  • HttpFetcher - A Fetcher for both http:// and https:// URLs.

    Downloaded files are automatically stored in the cache to save time/bandwidth. Supports HTTP Basic Auth within the URL.

  • FtpFetcher - A Fetcher for ftp:// URLs.

    Downloaded files are automatically stored in the cache to save time/bandwidth. Uses anonymous authentication by default, or embedded username/password in URL.

  • LocalFetcher - A local file Fetcher, which detects bare paths and file:// URLs

Wrappers

Wrappers are used to transparently decompress and/or extract files. They are automatically applied to Fetchers returned by GetFetcher based on the URL string provided.

  • TarballWrapper - A Wrapper for extracting files within (optionally compressed) .tar archives.

    It will recognize files ending in any the following suffixes: .tar .tar.gz .tgz .tar.bz2 .tbz2 .tar.bzip2

  • ZipWrapper - A Wrapper for extracting files within .zip archives.

  • BzWrapper - A decompression wrapper for bzip2'd files.

  • GzWrapper - A decompression wrapper for gzip'd files.

TODO List

  • Add unit tests
  • Flesh out more data format parsers
  • More compression formats? (LZMA/7-zip, etc)
  • Other network transfer types? (RPC, aspera, etc)

Documentation

Overview

Package anydata provides a toolkit to transparently fetch data files, cache them locally, and automatically decompress and/or extract records from them. It does so through the use of Fetcher and Wrapper interfaces. The "formats" and "filters" sub-packages include a variety of techniques that will parse and extract records and fields and interoperate well.

Current support includes opening files from local paths and the following URL schemes:

http:// https:// ftp:// file://

Transparent decompression is enabled for files (including remote URLs) ending in:

.gz .bz2 .bzip2 .zip

Extracting files from .tar and .zip archives is also supported through the use of URL fragments (#) specifying the archive extraction path. This is supported for the following extensions:

.tar .tar.gz .tgz .tar.bz2 .tbz2 .tar.bzip2

Archives referenced multiple times are only downloaded once and re-used as necessary. For example, the following 4 resource strings will result in only 2 FTP downloads:

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#names.dmp
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#nodes.dmp
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#citations.dmp

To add support for new URL schemes, implement the Fetcher interface and use RegisterFetcher before any calls to GetFetcher. You will likely also want to use Put/GetCachedFile to reduce network roundtrips as well. To add support for new archive or compression formats, implement the Wrapper interface and call RegisterWrapper.

Example (Usage)

List matching lines from a species taxonomy inside a remote tarball.

package main

import (
	"bufio"
	"fmt"
	"strings"

	"github.com/pbnjay/anydata"
)

func main() {

	// get a Fetcher for names.dmp in the the NCBI Taxonomy tarball
	taxNames := "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#names.dmp"
	ftch, err := anydata.GetFetcher(taxNames)
	if err != nil {
		panic(err)
	}

	// download the tarball (if necessary)
	err = ftch.Fetch(taxNames)
	if err != nil {
		panic(err)
	}

	// get an io.Reader to read from names.dmp
	rdr, err := ftch.GetReader()
	if err != nil {
		panic(err)
	}

	// print every line containing "scientific name"
	scanner := bufio.NewScanner(rdr)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.Contains(line, "scientific name") {
			fmt.Println(line)
		}
	}
}
Output:

1       |       root    |               |       scientific name |
2       |       Bacteria        |       Bacteria <prokaryote>   |       scientific name |
6       |       Azorhizobium    |               |       scientific name |
7       |       Azorhizobium caulinodans        |               |       scientific name |
9       |       Buchnera aphidicola     |               |       scientific name |
10      |       Cellvibrio      |               |       scientific name |
11      |       [Cellvibrio] gilvus     |               |       scientific name |
13      |       Dictyoglomus    |               |       scientific name |
14      |       Dictyoglomus thermophilum       |               |       scientific name |
16      |       Methylophilus   |               |       scientific name |
17      |       Methylophilus methylotrophus    |               |       scientific name |
...

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetCachedFile

func GetCachedFile(resource string) []byte

GetCachedFile returns the contents of a file (identified by resource) from the cache. If the resource is too old or does not exist, returns nil.

func InitCache

func InitCache(cpath string, ageDays int)

InitCache initializes the cache by loading prior cached dates and filenames from <cpath>/cacheinfo.json if it exists, and setting the desired data age (in days). If the cpath folder does not exist, it is created. If cacheinfo.json cannot be loaded, then an empty cache is created.

func PutCachedFile

func PutCachedFile(resource string, data []byte)

PutCachedFile saves the contents of a file (identified by resource) to the cache.

func RegisterFetcher

func RegisterFetcher(f Fetcher)

RegisterFetcher adds f to the list of known Fetchers for use by GetFetcher

func RegisterWrapper

func RegisterWrapper(w Wrapper)

RegisterWrapper adds w to the list of known Wrappers for use by GetFetcher

Types

type Fetcher

type Fetcher interface {
	// Fetch attempts to connect and/or fetch the resource (possibly asynchronously).
	// For non-file-based Fetchers, this is where API authentication, etc. should be verified.
	Fetch(resource string) error

	// GetReader returns the io.Reader for the resource.
	GetReader() (io.Reader, error)

	// Detect returns true if the resource string specified can be fetched by this instance.
	Detect(resource string) bool
}

Fetcher describes an instance that can be used to retrieve a data set (specified by a resource string) from a local/remote data source.

func GetFetcher

func GetFetcher(resource string) (Fetcher, error)

GetFetcher returns a Fetcher (optionally wrapped by a matching Wrapper) that will work on the specified resource string. It returns the last matching Fetcher (Wrapper) in registration order.

type Wrapper

type Wrapper interface {
	// DetectWrap returns true if the pathname (and optional partname) specified suits this Wrapper.
	DetectWrap(pathname, partname string) bool

	// Wrap returns a wrapped Fetcher that decompresses and/or reads the optional partname from f.
	Wrap(f Fetcher, partname string) (Fetcher, error)
}

Wrapper describes an instances that can wrap an existing Fetcher with additional functionality (such as transparent decompression).

Directories

Path Synopsis
Package filters provides a data-record filtering mechanism and basic implementations for typical use cases.
Package filters provides a data-record filtering mechanism and basic implementations for typical use cases.
Package formats provides record-based data format specification and parsing methods which are suitable for automation.
Package formats provides record-based data format specification and parsing methods which are suitable for automation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL