pachyderm

package module
v1.3.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2017 License: Apache-2.0 Imports: 9 Imported by: 0

README

GitHub release GitHub license GoDoc Slack Status

Pachyderm: A Containerized, Version-Controlled Data Lake

Pachyderm is:

For more details, see what's new about Pachyderm.

Getting Started

Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete developer docs to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

What's new about Pachyderm? (How is it different from Hadoop?)

There are two bold new ideas in Pachyderm:

  • Containers as the core processing primitive
  • Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!

Community

Keep up to date and get Pachyderm support via:

Contributing

To get started, sign the Contributor License Agreement.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "noob-friendly" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at jobs@pachyderm.io.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Asset added in v1.1.0

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir added in v1.1.0

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo added in v1.1.0

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames added in v1.1.0

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset added in v1.1.0

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func RestoreAsset added in v1.1.0

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets added in v1.1.0

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types

This section is empty.

Directories

Path Synopsis
doc
etc
migration
src
client/health
Package health is a generated protocol buffer package.
Package health is a generated protocol buffer package.
client/pfs
Package pfs is a generated protocol buffer package.
Package pfs is a generated protocol buffer package.
client/pkg/config
Package config is a generated protocol buffer package.
Package config is a generated protocol buffer package.
client/pkg/shard
Package shard is a generated protocol buffer package.
Package shard is a generated protocol buffer package.
client/pps
Package pps is a generated protocol buffer package.
Package pps is a generated protocol buffer package.
client/version/versionpb
Package versionpb is a generated protocol buffer package.
Package versionpb is a generated protocol buffer package.
server/pfs/db/persist
Package persist is a generated protocol buffer package.
Package persist is a generated protocol buffer package.
server/pfs/drive
Package drive provides the definitions for the low-level pfs storage drivers.
Package drive provides the definitions for the low-level pfs storage drivers.
server/pfs/fuse
Package fuse is a generated protocol buffer package.
Package fuse is a generated protocol buffer package.
server/pkg/cache/groupcachepb
Package groupcachepb is a generated protocol buffer package.
Package groupcachepb is a generated protocol buffer package.
server/pkg/deploy
Package deploy is a generated protocol buffer package.
Package deploy is a generated protocol buffer package.
server/pkg/hashtree
Package hashtree is a generated protocol buffer package.
Package hashtree is a generated protocol buffer package.
server/pkg/metrics
Package metrics is a generated protocol buffer package.
Package metrics is a generated protocol buffer package.
server/pkg/sync
Package sync provides utility functions similar to `git pull/push` for PFS
Package sync provides utility functions similar to `git pull/push` for PFS
server/pps
Package pps is a generated protocol buffer package.
Package pps is a generated protocol buffer package.
server/pps/persist
Package persist is a generated protocol buffer package.
Package persist is a generated protocol buffer package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL