preprocessd

command module

v0.0.0-...-d22195f Latest Latest Go to latest Published: Jan 24, 2020 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mchmarny/preprocessd

Links

Open Source Insights

README ¶

preprocessd

Simple example showing how to use Cloud Run to pre-process events before persisting them to the backing store (e.g. BigQuery). This is a common use-case where the raw data (e.g. submitted through REST API) needs to be pre-processed (e.g. decorated with additional attributed, classified, or simply validated) before saving.

Cloud Run is a great platform to build these kind of ingestion or pre-processing services:

Write each one of the pre-processing steps in the most appropriate (or favorite) development language
Bring your own runtime (or even specific version of that runtime) along with custom libraries
Dynamically scale up and down with your PubSub event load
Scale to 0, and don't pay anything, when there is nothing to process
Use granular access control with service account and policy bindings

Event Source

In this example will will use the synthetic events on PubSub topic generated by pubsub-event-maker utility. We will use it to mock synthetic utilization data from 3 devices and publish them to Cloud PubSub on the eventmaker topic in your project. The PubSub payload looks something like this:

{
    "source_id": "device-1",
    "event_id": "eid-b6569857-232c-4e6f-bd51-cda4e81f3e1f",
    "event_ts": "2019-06-05T11:39:50.403778Z",
    "label": "utilization",
    "mem_used": 34.47265625,
    "cpu_used": 6.5,
    "load_1": 1.55,
    "load_5": 2.25,
    "load_15": 2.49,
    "random_metric": 94.05090880450125
}

The instructions on how to configure pubsub-event-maker to start sending these events are here.

Pre-requirements

GCP Project and gcloud SDK

If you don't have one already, start by creating new project and configuring Google Cloud SDK. Similarly, if you have not done so already, you will have set up Cloud Run.

Setup

Build Container Image

Cloud Run runs container images. To build one we are going to use the included Dockerfile and submit the build job to Cloud Build using bin/image script.

Note, you should review each one of the provided scripts for complete content of these commands

bin/image

If this is first time you use the build service you may be prompted to enable the build API

Service Account and IAM Policies

In this example we are going to follow the principle of least privilege (POLP) to ensure our Cloud Run service has only the necessary rights and nothing more:

run.invoker - required to execute Cloud Run service
pubsub.editor - required to create and publish to Cloud PubSub
logging.logWriter - required for Stackdriver logging
cloudtrace.agent - required for Stackdriver tracing
monitoring.metricWriter - required to write custom metrics to Stackdriver

To do that we will create a GCP service account and assign the necessary IAM policies and roles using bin/account script:

bin/account

Cloud Run Service

Once you have configured the GCP accounts, you can deploy a new Cloud Run service and set it to run under that account using and preventing unauthenticated access bin/service script:

bin/service

PubSub Subscription

To enable PubSub to send topic data to Cloud Run service we will need to create a PubSub topic subscription and configure it to "push" events to the Cloud Service we deployed above.

bin/pubsub

Log

You can see the raw data and all the application log entries made by the service in Cloud Run service logs.

Saving Results

The process of saving resulting data from this service will depend on your target (the place where you want to save the data). HCP has a number of existing connectors and templates so, in most cases, you do not have to even write any code. Here is an example of a Dataflow template that streams PubSub topic data to BigQuery:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
    --parameters \
inputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME,\
outputTableSpec=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME

This approach will automatically deal with back-pressure, retries, monitoring and is not subject to the batch insert quote limits.

Cleanup

To cleanup all resources created by this sample execute the bin/cleanup script.

bin/cleanup

Disclaimer

This is my personal project and it does not represent my employer. I take no responsibility for issues caused by this code. I do my best to ensure that everything works, but if something goes wrong, my apologies is all you will get.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL