README
Apache Beam
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow and Hazelcast Jet.
Status
Post-commit tests status (on master branch)
Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2 |
---|---|---|---|---|---|---|
Go | --- | --- | --- | |||
Java | ||||||
Python | --- | --- | ||||
XLang | --- | --- |
Overview
Beam provides a general approach to expressing embarrassingly parallel data processing pipelines and supports three categories of users, each of which have relatively disparate backgrounds and needs.
- End Users: Writing pipelines with an existing SDK, running it on an existing runner. These users want to focus on writing their application logic and have everything else just work.
- SDK Writers: Developing a Beam SDK targeted at a specific user community (Java, Python, Scala, Go, R, graphical, etc). These users are language geeks and would prefer to be shielded from all the details of various runners and their implementations.
- Runner Writers: Have an execution environment for distributed processing and would like to support programs written against the Beam Model. Would prefer to be shielded from details of multiple SDKs.
The Beam Model
The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model”.
To learn more about the Beam Model (though still under the original name of Dataflow), see the World Beyond Batch: Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site, and the VLDB 2015 paper.
The key concepts in the Beam programming model are:
PCollection
: represents a collection of data, which could be bounded or unbounded in size.PTransform
: represents a computation that transforms input PCollections into output PCollections.Pipeline
: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.PipelineRunner
: specifies where and how the pipeline should execute.
SDKs
Beam supports multiple language specific SDKs for writing pipelines against the Beam Model.
Currently, this repository contains SDKs for Java, Python and Go.
Have ideas for new SDKs or DSLs? See the JIRA.
Runners
Beam supports executing programs on multiple distributed processing backends through PipelineRunners. Currently, the following PipelineRunners are available:
- The
DirectRunner
runs the pipeline on your local machine. - The
DataflowRunner
submits the pipeline to the Google Cloud Dataflow. - The
FlinkRunner
runs the pipeline on an Apache Flink cluster. The code has been donated from dataArtisans/flink-dataflow and is now part of Beam. - The
SparkRunner
runs the pipeline on an Apache Spark cluster. The code has been donated from cloudera/spark-dataflow and is now part of Beam. - The
JetRunner
runs the pipeline on a Hazelcast Jet cluster. The code has been donated from hazelcast/hazelcast-jet and is now part of Beam. - The
Twister2Runner
runs the pipeline on a Twister2 cluster. The code has been donated from DSC-SPIDAL/twister2 and is now part of Beam.
Have ideas for new Runners? See the JIRA.
Getting Started
To learn how to write Beam pipelines, read the Quickstart for [Java, Python, or Go] available on our website.
Contact Us
To get involved in Apache Beam:
- Subscribe or mail the user@beam.apache.org list.
- Subscribe or mail the dev@beam.apache.org list.
- Join ASF Slack on #beam channel
- Report issues on JIRA.
Instructions for building and testing Beam itself are in the contribution guide.
More Information
- Apache Beam
- Overview
- Quickstart: Java, Python, Go
- Community metrics
Directories
Path | Synopsis |
---|---|
sdks/go/cmd/beamctl | beamctl is a command line client for the Apache Beam portability services. |
sdks/go/cmd/beamctl/cmd | Package cmd contains the commands for beamctl. |
sdks/go/cmd/specialize | specialize is a low-level tool to generate type-specialized code. |
sdks/go/cmd/starcgen | starcgen is a tool to generate specialized type assertion shims to be used in Apache Beam Go SDK pipelines instead of the default reflection shim. |
sdks/go/cmd/symtab | Package verifies that functions sym2addr and addr2sym work correctly. |
sdks/go/container | |
sdks/go/examples/complete/autocomplete | |
sdks/go/examples/contains | |
sdks/go/examples/cookbook/combine | |
sdks/go/examples/cookbook/filter | |
sdks/go/examples/cookbook/join | |
sdks/go/examples/cookbook/max | |
sdks/go/examples/cookbook/tornadoes | tornadoes is an example that reads the public samples of weather data from BigQuery, counts the number of tornadoes that occur in each month, and writes the results to BigQuery. |
sdks/go/examples/debugging_wordcount | debugging_wordcount is an example that verifies word counts in Shakespeare and includes Beam best practices. |
sdks/go/examples/forest | forest is an example that shows that pipeline construction is normal Go code -- the pipeline "forest" is created recursively and uses a global variable -- and that a pipeline may contain non-connected parts. |
sdks/go/examples/grades | |
sdks/go/examples/minimal_wordcount | minimal_wordcount is an example that counts words in Shakespeare. |
sdks/go/examples/multiout | multiout is a wordcount variation that uses a multi-outout DoFn and writes 2 output files. |
sdks/go/examples/pingpong | |
sdks/go/examples/readavro | readavro is a simple Avro read/write Example This example uses a 500 Byte sample avro file [twitter.avro] download here: https://s3-eu-west-1.amazonaws.com/daidokoro-dev/apache/twitter.avro |
sdks/go/examples/streaming_wordcap | streaming_wordcap is a toy streaming pipeline that uses PubSub. |
sdks/go/examples/stringsplit | An example of using a Splittable DoFn in the Go SDK with a portable runner. |
sdks/go/examples/windowed_wordcount | windowed_wordcount counts words in text, and can run over either unbounded or bounded input collections. |
sdks/go/examples/wordcount | wordcount is an example that counts words in Shakespeare and includes Beam best practices. |
sdks/go/examples/xlang | Package xlang contains functionality for testing cross-language transforms. |
sdks/go/examples/xlang/cogroup_by | cogroup_by exemplifies using a cross-language cogroup by key transform from a test expansion service. |
sdks/go/examples/xlang/combine | combine exemplifies using a cross-language combine per key transform from a test expansion service. |
sdks/go/examples/xlang/combine_globally | combine_globally exemplifies using a cross-language combine global transform from a test expansion service. |
sdks/go/examples/xlang/flatten | |
sdks/go/examples/xlang/group_by | group_by exemplifies using a cross-language group by key transform from a test expansion service. |
sdks/go/examples/xlang/multi_input_output | multi exemplifies using a cross-language transform with multiple inputs and outputs from a test expansion service. |
sdks/go/examples/xlang/partition | partition exemplifies using a cross-language partition transform from a test expansion service. |
sdks/go/examples/xlang/wordcount | wordcount exemplifies using a cross-language Count transform from a test expansion service to count words. |
sdks/go/examples/yatzy | yatzy is an implementation of https://en.wikipedia.org/wiki/Yatzy that shows that pipeline construction is normal Go code. |
sdks/go/pkg/beam | Package beam is an implementation of the Apache Beam (https://beam.apache.org) programming model in Go. |
sdks/go/pkg/beam/artifact | Package artifact contains utilities for staging and retrieving artifacts. |
sdks/go/pkg/beam/artifact/gcsproxy | Package gcsproxy contains artifact staging and retrieval servers backed by GCS. |
sdks/go/pkg/beam/core/funcx | Package funcx contains functions and types used to perform type analysis of Beam functions. |
sdks/go/pkg/beam/core/graph | Package graph is the internal representation of the Beam execution plan. |
sdks/go/pkg/beam/core/graph/coder | Package coder contains coder representation and utilities. |
sdks/go/pkg/beam/core/graph/mtime | Package mtime contains a millisecond representation of time. |
sdks/go/pkg/beam/core/graph/window | Package window contains window representation, windowing strategies and utilities. |
sdks/go/pkg/beam/core/metrics | Package metrics implements the Beam metrics API, described at http://s.apache.org/beam-metrics-api Metrics in the Beam model are uniquely identified by a namespace, a name, and the PTransform context in which they are used. |
sdks/go/pkg/beam/core/runtime | Package runtime contains runtime hooks and utilities for pipeline options and type registration. |
sdks/go/pkg/beam/core/runtime/coderx | Package coderx contains coders for primitive types that aren't included in the beam model. |
sdks/go/pkg/beam/core/runtime/exec | Package exec contains runtime plan representation and execution. |
sdks/go/pkg/beam/core/runtime/exec/optimized | Package optimized contains type-specialized shims for faster execution. |
sdks/go/pkg/beam/core/runtime/genx | Package genx is a convenience package to better support the code generator. |
sdks/go/pkg/beam/core/runtime/graphx | Package graphx provides facilities to help with the serialization of pipelines into a serializable graph structure suitable for the worker. |
sdks/go/pkg/beam/core/runtime/graphx/schema | Package schema contains utility functions for relating Go types and Beam Schemas. |
sdks/go/pkg/beam/core/runtime/graphx/v1 | Package v1 is a generated protocol buffer package. |
sdks/go/pkg/beam/core/runtime/harness | Package harness implements the SDK side of the Beam FnAPI. |
sdks/go/pkg/beam/core/runtime/harness/init | Package init contains the harness initialization code defined by the FnAPI. |
sdks/go/pkg/beam/core/runtime/harness/session | Package session is a generated protocol buffer package. |
sdks/go/pkg/beam/core/runtime/metricsx | |
sdks/go/pkg/beam/core/runtime/pipelinex | Package pipelinex contains utilities for manipulating Beam proto pipelines. |
sdks/go/pkg/beam/core/runtime/xlangx | |
sdks/go/pkg/beam/core/sdf | Package contains interfaces used specifically for splittable DoFns. |
sdks/go/pkg/beam/core/typex | Package typex contains full type representation for PCollections and DoFns, and utilities for type checking. |
sdks/go/pkg/beam/core/util/dot | Package dot produces DOT graphs from Beam graph representations. |
sdks/go/pkg/beam/core/util/hooks | Package hooks allows runners to tailor execution of the worker harness. |
sdks/go/pkg/beam/core/util/ioutilx | Package ioutilx contains additional io utilities. |
sdks/go/pkg/beam/core/util/jsonx | Package jsonx contains utilities for working with JSON encoded data. |
sdks/go/pkg/beam/core/util/protox | Package protox contains utilities for working with protobufs. |
sdks/go/pkg/beam/core/util/reflectx | Package reflectx contains a set of reflection utilities and well-known types. |
sdks/go/pkg/beam/core/util/stringx | Package stringx contains utilities for working with strings. |
sdks/go/pkg/beam/core/util/symtab | Package symtab allows reading low-level symbol information from the symbol table. |
sdks/go/pkg/beam/internal/errors | Package errors contains functionality for creating and wrapping errors with improved formatting compared to the standard Go error functionality. |
sdks/go/pkg/beam/io/avroio | Package avroio contains transforms for reading and writing avro files. |
sdks/go/pkg/beam/io/bigqueryio | Package bigqueryio provides transformations and utilities to interact with Google BigQuery. |
sdks/go/pkg/beam/io/databaseio | Package databaseio provides transformations and utilities to interact with generic database database/sql API. |
sdks/go/pkg/beam/io/datastoreio | Package datastoreio provides transformations and utilities to interact with Google Datastore. |
sdks/go/pkg/beam/io/filesystem | Package filesystem contains an extensible file system abstraction. |
sdks/go/pkg/beam/io/filesystem/gcs | Package gcs contains a Google Cloud Storage (GCS) implementation of the Beam file system. |
sdks/go/pkg/beam/io/filesystem/local | Package local contains a local file implementation of the Beam file system. |
sdks/go/pkg/beam/io/filesystem/memfs | Package memfs contains a in-memory Beam filesystem. |
sdks/go/pkg/beam/io/pubsubio | Package pubsubio provides access to PubSub on Dataflow streaming. |
sdks/go/pkg/beam/io/pubsubio/v1 | Package v1 is a generated protocol buffer package. |
sdks/go/pkg/beam/io/rtrackers/offsetrange | Package offsetrange defines a restriction and restriction tracker for offset ranges. |
sdks/go/pkg/beam/io/synthetic | Package synthetic contains transforms for creating synthetic pipelines. |
sdks/go/pkg/beam/io/textio | Package textio contains transforms for reading and writing text files. |
sdks/go/pkg/beam/log | Package log contains a re-targetable context-aware logging system. |
sdks/go/pkg/beam/model | Package model contains the portable Beam model contracts. |
sdks/go/pkg/beam/model/fnexecution_v1 | |
sdks/go/pkg/beam/model/jobmanagement_v1 | |
sdks/go/pkg/beam/model/pipeline_v1 | |
sdks/go/pkg/beam/options/gcpopts | Package gcpopts contains shared options for Google Cloud Platform. |
sdks/go/pkg/beam/options/jobopts | Package jobopts contains shared options for job submission. |
sdks/go/pkg/beam/provision | Package provision contains utilities for obtaining runtime provision, information -- such as pipeline options. |
sdks/go/pkg/beam/runners/dataflow | Package dataflow contains the Dataflow runner for submitting pipelines to Google Cloud Dataflow. |
sdks/go/pkg/beam/runners/dataflow/dataflowlib | Package dataflowlib translates a Beam pipeline model to the Dataflow API job model, for submission to Google Cloud Dataflow. |
sdks/go/pkg/beam/runners/direct | Package direct contains the direct runner for running single-bundle pipelines in the current process. |
sdks/go/pkg/beam/runners/dot | Package dot is a Beam runner that "runs" a pipeline by producing a DOT graph of the execution plan. |
sdks/go/pkg/beam/runners/flink | Package flink contains the Flink runner. |
sdks/go/pkg/beam/runners/session | |
sdks/go/pkg/beam/runners/spark | Package spark contains the Spark runner. |
sdks/go/pkg/beam/runners/universal | Package universal contains a general-purpose runner that can submit jobs to any portable Beam runner. |
sdks/go/pkg/beam/runners/universal/extworker | Package extworker provides an external worker service and related utilities. |
sdks/go/pkg/beam/runners/universal/runnerlib | Package runnerlib contains utilities for submitting Go pipelines to a Beam model runner. |
sdks/go/pkg/beam/runners/vet | Package vet is a Beam runner that "runs" a pipeline by producing generated code to avoid symbol table lookups and reflection in pipeline execution. |
sdks/go/pkg/beam/runners/vet/testpipeline | Package testpipeline exports small test pipelines for testing the vet runner. |
sdks/go/pkg/beam/testing/passert | Package passert contains verification transformations for testing pipelines. |
sdks/go/pkg/beam/testing/ptest | Package ptest contains utilities for pipeline unit testing. |
sdks/go/pkg/beam/transforms/filter | Package filter contains transformations for removing pipeline elements based on various conditions. |
sdks/go/pkg/beam/transforms/stats | Package stats contains transforms for statistical processing. |
sdks/go/pkg/beam/transforms/top | Package top contains transformations for finding the smallest (or largest) N elements based on arbitrary orderings. |
sdks/go/pkg/beam/util/errorx | Package errorx contains utilities for handling errors. |
sdks/go/pkg/beam/util/execx | Package execx contains wrappers and utilities for the exec package. |
sdks/go/pkg/beam/util/gcsx | Package gcsx contains utilities for working with Google Cloud Storage (GCS). |
sdks/go/pkg/beam/util/grpcx | Package grpcx contains utilities for working with gRPC. |
sdks/go/pkg/beam/util/pubsubx | Package pubsubx contains utilities for working with Google PubSub. |
sdks/go/pkg/beam/util/shimx | Package shimx specifies the templates for generating type assertion shims for Apache Beam Go SDK pipelines. |
sdks/go/pkg/beam/util/starcgenx | Package starcgenx is a Static Analysis Type Assertion shim and Registration Code Generator which provides an extractor to extract types from a package, in order to generate approprate shimsr a package so code can be generated for it. |
sdks/go/pkg/beam/util/syscallx | Package syscallx provides system call utilities that attempt to hide platform differences. |
sdks/go/pkg/beam/x/beamx | Package beamx is a convenience package for beam. |
sdks/go/pkg/beam/x/debug | Package debug contains pipeline components that may help in debugging pipeline issues. |
sdks/go/pkg/beam/x/hooks/perf | Package perf is to add performance measuring hooks to a runner, such as cpu, heap, or trace profiles. |
sdks/go/test/integration | The integration driver provides a suite of tests to run against a registered runner. |
sdks/go/test/integration/primitives | |
sdks/go/test/integration/synthetic | Package synthetic contains pipelines for testing synthetic steps and sources. |
sdks/go/test/integration/wordcount | Package wordcount contains transforms for wordcount. |
sdks/go/test/load | |
sdks/go/test/load/cogbk | |
sdks/go/test/load/combine | |
sdks/go/test/load/group_by_key | |
sdks/go/test/load/pardo | |
sdks/go/test/load/sideinput | |
sdks/go/test/regression | Package regression contains pipeline regression tests. |
sdks/go/test/regression/coders/fromyaml | fromyaml generates a resource file from the standard_coders.yaml file for use in these coder regression tests. |
sdks/go/test/validatesrunner | Package validatesrunner contains Validates Runner tests, which are a type of integration test that execute short pipelines on various runners to validate runner behavior. |
sdks/java/container | boot is the boot code for the Java SDK harness container. |
sdks/python/container | boot is the boot code for the Python SDK harness container. |