Overview
This project
intends to develop and maintain a command-line (CLI) utility in Go
to help deploy data engineering pipelines on modern data stack (MDS).
Even though the members of the GitHub organization may be employed by
some companies, they speak on their personal behalf and do not represent
these companies.
References
AWS SDK for Go
Getting started
$ go get github.com/data-engineering-helpers/dppctl@vx.y.z
- Clone and edit the YAML deployment specification. For instance,
for a deployment on AWS cloud:
$ cp depl/aws-dev-sample.yaml depl/aws-dev.yaml
$ vi depl/aws-dev.yaml
- Check the version of the
dppctl
utility:
$ dppctl -v
[dppctl] 0.0.x-alpha.x
- Launch the
dppctl
utility in checking mode (which is the default one):
$ dppctl -f depl/aws-dev.yaml
- Launch the
dppctl
utility in deployment mode:
$ dppctl -f depl/aws-dev.yaml -c deploy
Publish the module
- Recompute the dependencies:
$ go mod tidy
- Check that the tests pass:
$ go test
$ git commit -m "[Release] v0.0.x-alpha.x"
$ git push
$ git tag -a v0.0.x-alpha.x -m "[Release] v0.0.x-alpha.x"
$ git push --tags
$ GOPROXY=proxy.golang.org go list -m github.com/data-engineering-helpers/dppctl@v0.0.x-alpha.x
github.com/data-engineering-helpers/data-pipeline-deployment v0.0.x-alpha.x
Troubleshooting
AWS Airflow (MWAA)
As of beginning of 2023, apparently for security reasons, it does not seem
possible to target/use the Airflow API directly on
the AWS managed service (MWAA). One has to use instead the API backend
of the MWAA CLI. That is why the Go code of
the corresponding AWSAirflowCLI()
function
is not straightforward.
Note that the use of the MWAA CLI API (through curl
) is itself
convoluted, as detailed below.
References
Listing the DAGs
$ export MWAA_ENV="<the-MWAA-environment-name"
export AWS_REGION="eu-west-1"
export CLI_TOKEN
export WEB_SERVER_HOSTNAME
- Create a CLI (command-line) token:
$ aws mwaa --region $AWS_REGION create-cli-token --name $MWAA_ENV
{
"CliToken": "someToken",
"WebServerHostname": "<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
}
- Copy/paste the web server hostname and the CLI token and save them
as environment variables:
$ CLI_TOKEN="someToken"
WEB_SERVER_HOSTNAME="<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
-
Note that the CLI token is very short-lived (valid for only one or two times)
and the two operations (aws mwaa create-cli-token
and
CLI_TOKEN="some-token"
) must be repeated every time before
the following commands are perfomed
-
Invoke an Airflow command through the API wrapping the MWAA CLI
- Raw (not formatted) outpout:
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d
...
[{"dag_id": "dag_name", "filepath": "prefix/script.py", "owner": "airflow", "paused": "True"}, {"dag_id": ...}, ...]
- CSV-formatted outpout (list of DAGs):
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d | grep "^\[{\"dag_id\"" | jq -r ".[]|[.dag_id,.filepath,.owner,.paused]|@csv" | sed -e s/\"//g
...
...
dag_name,prefix/script.py,airflow,True
...