dmk

command module
v0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 17, 2017 License: GPL-3.0 Imports: 14 Imported by: 0

README

dmk

This is a simplified, automated build tool for data projects.

The idea behind dmk is to support build files that are easy to read and write, and to support automating a build system for data artifacts. make, scons, and fac all provided inspiration for dmk.

Who should use this?

This is a tool for data flows and simple projects. Often these projects involve one-time steps for getting the data used for analysis. Often the user manually places that data in the directory after downloading it from Amazon S3 or a research server or whatever. Scripts or programs run in a pipeline fashion: handling cleaning, transformation, analysis, model building, figure production, presentation building, etc.

Pipelines like this are often written Python or R, partially automated with shell scripts, and then tied together with a Makefile (or a SConstruct file if you're an scons fan).

Protip: if you're looking for a command to handle building reports from .tex files (including handling metapost and biblatex), look into rubber.

What is this NOT for?

This is not mean to replace a real automated build tool for a software project. As a general rule:

  • If you're build Go software, use the Go tools (and optionally make)
  • If you're building Java/Scala use sbt, gradle, mvn, ant, etc
  • If you're building .NET, erm, I'm not sure
  • There are great tools like scons that understand how to build lots of artifacts (including LaTeX docs)
  • If you're not sure, at least understand why you wouldn't use make

For instance, this project (written in Go) is actually built with make + the standard Go tools.

Using

When running dmk, you may specify -h on the command line for information on command line parameters. Specify -v for "verbose" mode to get output you may want for debugging or understanding what is going on.

For each command in a pipeline, you need to supply:

  • A name
  • The actual command to run (in the shell)
  • The inputs required
  • The outputs generated
  • (Optionally) a list of intermediate files that clean process can delete (see below for cleaning)
  • (Optionally) a flag indicated if the step must only run if required by an explicit command

The file is generally named Pipeline or pipeline.yaml. If you do not specify a pipeline file with the -f command line parameter, dmk looks for the following names in the current directory (in order):

  • Pipeline
  • Pipeline.yaml
  • pipeline
  • pipeline.yaml
  • .Pipeline.Pipeline.yaml
  • .pipeline.pipeline.yaml

You may also supply a custom name with the -f command line flag. If the pipeline file is in a different directory, dmk will change to that directory before parsing the config file.

All build steps run in parallel, but each step waits until other steps build its dependencies. A single build step executes the following steps:

  1. The step is "Started"
  2. If any of the required inputs are another step's outputs, then wait for a built message.
  3. Check to see if any outputs are older than any of the inputs. If not, then the step is "Completed"!
  4. If not done, set status to "Executing" and run the command.
  5. If the command returns an error code or if any outputs are missing or older than any inputs, the step is "Failed".
  6. If not done, send notification messages for each output for any waiting steps.
  7. The step is now "Completed"

The outputs for a step must be unique to that step: you can't have two steps both list foo.data as an output.

dmk provides an automatic "clean" mode that deletes all outputs. To use it, specify -c on the command line. dmk will delete all the outputs for all steps. If you have files to clean not specified as outputs, you can specified them in the clean list for a build step (see the Pipeline file format below). Good candidates for the clean section are intermediate files (such as logs) generated as part of a build process that are not dependencies and should not determine if a build step is up to date.

Pipeline file format

The file is in YAML format where each build step is a named hash. Each build step should specify:

  • command - The command to run as part of the build. dmk uses bash to run the command, so it can rely on bash shell niceties (like using ~ for the home directory)
  • inputs - a list of inputs needed for the build. These are also the dependencies that must exist before the step can run. An entry can be a glob pattern (like *.txt)
  • outputs - a list of outputs generated by the step. Outputs decide if the step must run, and the clean phase deletes them. Glob patterns are ignored for outputs.
  • clean - A list of files to clean. These and outputs are the files deleted during a clean. You may use glob patterns for these.
  • explicit - Optional, default to false. If set to true, the step will run if you specify it on the command line, or if it is required by a step that you specified on the command line.

The res subdirectory contains sample Pipeline files (used for testing), but a quick example would look like:

# You can have comments in a file
step1:                                # first step
    command: "xformxyz i{1,2,3}.txt"  # command with some shell magic
    inputs:                           # 3 inputs (read by our imaginary command)
        - i1.txt                  
        - i2.txt
        - i3.txt
    outputs:                          # 3 outputs
        - o1.txt
        - o2.txt
        - o3.txt
    clean: [a.aux, b.log]             # two extra clean targets, specified in
                                      # an alternate syntax for YAML lists

step2:                                # second step
    command: cmd1xyz                  # note the lack of inputs - this means
    outputs:                          # the step can run immediately, but will
        - output.bin                  # ONLY run if an output is missing

depstep:                              # third/final step: it won't run until the
    command: cmd2xyz                  # previous steps are finished because their
    inputs:                           # outputs are specified in the this step's
        - o3.txt                      # inputs.
        - output.bin
    outputs:
        - combination.output
    clean:
        - need-cleaning.*             # An example of using a glob pattern

extrastep:
    command: special-command
    inputs:
        - some-script-file.txt
    outputs:
        - my-special-file.extra
    explicit: true                    # Only run if specified on command line

Building

godep manages dependencies in the vendor directory. You shouldn't need to worry about this if you are building with the Makefile. Also note the fact that we use make to build dmk. We are serious about using the correct build tool for the job.

In addition to go, godep, and make, you should also have Python 3 installed (for script/update and for the test script res/slow)

Build step environment

When a build step is executed, a number of environment variables are set that may be used by the step:

  • DMK_VERSION - version string for dmk
  • DMK_PIPELINE - absolute path to the pipeline file being executed
  • DMK_STEPNAME - the name of the current step
  • DMK_INPUTS - a colon (":") delimited list of inputs for this step
  • DMK_OUTPUTS - a colon (":") delimited list of outputs for this step
  • DMK_CLEAN - a colon (":") delimited list of extra clean files for this step

Some helpful hints to remember

A pipeline file is just a YAML document, and a JSON document is valid YAML. For instance, res/slowbuild.yaml and res/slowbuild.json are equivalent pipeline files. If you need a customized build, you can generate the pipeline file in the language of your choice in JSON or YAML and then call dmk.

Commands are executed in a new bash shell (which also means you need bash).

File names are assumed to be relative to the Pipeline file. The current working directory is changed to the same directory as the Pipeline file before anything is done.

You may use globbing patterns for the inputs and clean

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL