README ¶
TypeScript Beam SDK
This is the start of a fully functioning JavaScript (actually, TypeScript) SDK. There are two distinct aims with this SDK:
-
Tap into the large (and relatively underserved, by existing data processing frameworks) community of JavaScript developers with a native SDK targeting this language.
-
Develop a new SDK which can serve both as a proof of concept and reference that highlights the (relative) ease of porting Beam to new languages, a differentiating feature of Beam and Dataflow.
To accomplish this, we lean heavily on the portability framework. For example, we make heavy use of cross-language transforms, in particular for IOs. In addition, the direct runner is simply an extension of the worker suitable for running on portable runners such as the ULR, which will directly transfer to running on production runners such as Dataflow and Flink. The target audience should hopefully not be put off by running other-language code encapsulated in docker images.
Getting started
To install and test the Typescript SDK from source, you will need npm
and
python
. Other requirements can be installed by npm
later on.
(Note that Python is a requirement as it is used to orchestrate Beam functionality.)
- First you must clone the Beam repository and go to the
typescript
directory.
git checkout https://github.com/apache/beam
cd beam/sdks/typescript/
- Execute a local install of the necessary packages:
npm install
- Then run
./build.sh
to transpile Typescript files into JS files.
Development workflows
All of the development workflows (build, test, lint, clean, etc) are defined in
package.json
and can be run with npm
commands (e.g. npm run build
).
Running a pipeline
The wordcount.ts
file defines a parameterizable pipeline that can be run
against different runners. You can run it from the transpiled .js
file
like so:
node dist/src/apache_beam/examples/wordcount.js ${PARAMETERS}
To run locally:
node dist/src/apache_beam/examples/wordcount.js --runner=direct
To run against Flink, where the local infrastructure is automatically downloaded and set up:
node dist/src/apache_beam/examples/wordcount.js --runner=flink
To run on Dataflow:
node dist/src/apache_beam/examples/wordcount.js \
--runner=dataflow \
--project=${PROJECT_ID} \
--tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION}
API
We generally try to apply the concepts from the Beam API in a TypeScript idiomatic way, but it should be noted that few of the initial developers have extensive (if any) JavaScript/TypeScript development experience, so feedback is greatly appreciated.
In addition, some notable departures are taken from the traditional SDKs:
-
We take a "relational foundations" approach, where schema'd data is the primary way to interact with data, and we generally eschew the key-value requiring transforms in favor of a more flexible approach naming fields or expressions. JavaScript's native Object is used as the row type.
-
As part of being schema-first we also de-emphasize Coders as a first-class concept in the SDK, relegating it to an advanced feature used for interop. Though we can infer schemas from individual elements, it is still TBD to figure out if/how we can leverage the type system and/or function introspection to regularly infer schemas at construction time. A fallback coder using BSON encoding is used when we don't have sufficient type information.
-
We have added additional methods to the PCollection object, notably
map
andflatmap
, rather than only allowing apply. In addition,apply
can accept a function argument(PCollection) => ...
as well as a PTransform subclass, which treats this callable as if it were a PTransform's expand. -
In the other direction, we have eliminated the problematic Pipeline object from the API, instead providing a
Root
PValue on which pipelines are built, and invoking run() on a Runner. We offer a less error-proneRunner.run
which finishes only when the pipeline is completely finished as well asRunner.runAsync
which returns a handle to the running pipeline. -
Rather than introduce PCollectionTuple, PCollectionList, etc. we let PValue literally be an array or object with PValue values which transforms can consume or produce. These are applied by wrapping them with the
P
operator, e.g.P([pc1, pc2, pc3]).apply(new Flatten())
. -
Like Python,
flatMap
andParDo.process
return multiple elements by yielding them from a generator, rather than invoking a passed-in callback. TBD how to output to multiple distinct PCollections. There is currently an operation to split a PCollection into multiple PCollections based on the properties of the elements, and we may consider using a callback for side outputs. -
The
map
,flatMap
, andParDo.process
methods take an additional (optional) context argument, which is similar to the keyword arguments used in Python. These can be "ordinary" javascript objects (which are passed as is) or special DoFnParam objects which provide getters to element-specific information (such as the current timestamp, window, or side input) at runtime. -
Rather than introduce multiple-output complexity into the map/do operations themselves, producing multiple outputs is done by following with a new
Split
primitive that takes aPCollection<{a?: AType, b: BType, ... }>
and produces an object{a: PCollection<AType>, b: PCollection<BType>, ...}
. -
JavaScript supports (and encourages) an asynchronous programing model, with many libraries requiring use of the async/await paradigm. As there is no way (by design) to go from the asynchronous style back to the synchronous style, this needs to be taken into account when designing the API. We currently offer asynchronous variants of
PValue.apply(...)
(in addition to the synchronous ones, as they are easier to chain) as well as makingRunner.run
asynchronous. TBD to do this for all user callbacks as well.
An example pipeline can be found at https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts
TODO
This SDK is a work in progress. In January 2022 we developed the ability to construct and run basic pipelines (including external transforms and running on a portable runner) but the following big-ticket items remain.
-
Containerization
-
Actually use worker threads for multiple bundles.
-
Build and test containers with gradle tasks.
-
-
API
-
There are several TODOs of minor features or design decisions to finalize.
-
Consider using (or supporting) 2-arrays rather than {key, value} objects for KVs.
-
Force the second argument of map/flatMap to be an Object, which would lead to a less confusing API (vs. Array.map) and clean up the implementation. Also add a [do]Filter, and possibly a [do]Reduce?
-
Move away from using classes.
-
-
Advanced features like metrics, state, timers, and SDF. Possibly some of these can wait.
-
-
Other
-
Enforce unique names for pipeline update.
-
Cleanup uses of
var
,this
. Arrow functions. -
Avoid
any
return types (and re-enable check in compiler). -
Relative vs. absoute imports, possibly via setting a base url with a
jsconfig.json
. -
More/better tests, including tests of illegal/unsupported use.
-
Set channel options like
grpc.max_{send,receive}_message_length
as we do in other SDKs. -
Reduce use of
any
.-
Could use
unknown
in its place where the type is truly unknown. -
It'd be nice to enforce, maybe re-enable
noImplicitAny: true
in tsconfig if we can get the generated proto files to be ignored.
-
-
Enable a linter like eslint and fix at least the low hanging fruit.
-
There is probably more; there are many TODOs littered throughout the code.
Development.
Getting stared
Install node.js, and then from within sdks/typescript
.
npm install
Running tests
npm test
Style
We have adopted prettier which can be run with
npx prettier --write .
Documentation ¶
There is no documentation for this package.