Do you have a success story? Please share it!
Send a pull-request on github describing your project, how doit is used, and why doit was chosen.
by Christopher Dembia (2014-04-03)
I am a graduate student in a biomechanics lab that studies how humans coordinate their muscles in movements like walking or running. We record someone’s motion using reflective motion capture markers and by recording the force their feet exert on the ground. We put this data into our software, which gives back an estimate of the force each muscle (92 muscles) was generating throughout the observed motion. In a typical study, we record about 100 walking motions. To analyze a single walking motion, we need to run 4 different executables in sequence. Each executable requires a handful of input files, and generates a handful of output files that a subsequent executable uses as input.
So, a study entails about 1000 files, some of which contain raw data, but most of which are intermediate files (output of one executable and input to another executable). Typically, a researcher manages this workflow manually. However, that is prone to error, as a researcher may forget to properly modify all relevant files if an error is noticed in, for example, a raw data file.
With doit, I am automating this workflow for my current study. This allows me to avoid errors and avoid unnecessary duplication of files. Most importantly, if I learn that I must modify something in a file that is an input toward the beginning of this workflow, doit will allow me to automatically update all my results without missing a step.
I tried to do this with Make first. Make just was not made to do what I want. Also, my lab’s software has python bindings, so my entire workflow can be in python. Also, the ability to script anything directly into the workflow is important, and Make can’t do that. CMake was another option, but that’s not general enough. doit is just completely generic, and the interface is simple yet very flexible.
Computational Metagenomics Lab / University of Trento, Italy¶
by Nicola Segata (2015-01-20)
My laboratory of Computational Metagenomics at University of Trento studies the human microbiome, i.e. the huge microbial diversity populating our body.
Our analyses involve processing several thousands of microbial genomes (long sequences of DNA) with a series of computational steps (mostly written in python) on subset of those genomes. Genomes are organized in a hierarchical structure (the taxonomy) and many steps of our pipelines need to take into account these dependencies.
doit is just the right way to do this. We actually even tried to implement something like this on our own, but we are now switching to doit because it has all the features we need and it is intuitive to use. We particularly love the possibility of specifying dependencies “on the fly”.
We are now thinking to convert all our pipelines to the doit format. Given the importance of doit for our research and its potential for bioinformatic pipelines we are happy to support this project scientifically (by citing it in our papers, mentioning it in our funding requests, etc).
Thanks for developing doit, it’s just wonderful for computational biology (and for many other tasks, of course, but this is our research field:)!
by Jon Sorenson (2018-12-14)
I came to doit from the loose requirement that I wanted a task-dependency/DAG type of workflow solution for the various pipelines that our team is constructing. I come from computational biology originally, and the similarity of data processing pipelines to build systems has long been appreciated. More than a decade ago many of us were writing bioinformatics pipelines in make because it gave us so many features “for free.”
Starting afresh at Atomwise I did a survey of what DAG-based workflow execution frameworks were out there—restricting my search to python, active maintenance, good documentation etc. Besides doit I evaluated bonobo, Luigi, and airflow. bonobo didn’t fit my needs for dependency-based processing. Luigi and airflow are very nice, but I didn’t have any particular need for distributed workflows and the heavier weight feel of these platforms. My favorite experience was airflow but it didn’t have (obvious) support for re-entrant processing: running a pipeline from an intermediate stage.
I knew that build-system based frameworks would do exactly what I wanted and not hand me too much cruft, and on that note I found doit. It’s worked perfectly for my needs:
The order of tasks is derived from dependencies
Tasks can be executed in parallel if they don’t depend on each other
Plenty of bells and whistles to relieve the pipeline developer from writing boilerplate code (easy task documentation, discovery of the pipeline state, easy process for re-doing tasks)
Very light-weight. In this sense doit is the perfect example of a UNIX-style tool: do one thing and do it well. Luigi and airflow are attractive, but they also suffer from kitchen-sink bloat. If you simply want a pythonic alternative to Makefiles or bash scripts, doit is great solution.
It’s easy to build up a library of common tasks that can be reused by multiple doit pipelines.
Data-Driven Modelling of Robots / McGill University, Canada¶
by Steven Dahdah (2022-09-06)
The DECAR Systems Group at McGill University
conducts fundamental and applied research on state estimation, guidance, and
control. We use
doit to improve the reproducibility of our research.
One of our recent projects involves modelling a soft robot arm from experimental data using the Koopman operator. Our pipeline consists of many small tasks like preprocessing data, profiling algorithms, and generating plots, along with a few large regression problems that must be run overnight.
Organizing this pipeline was quite a challenge. First we tried
worked well in the beginning, but fell apart when we had to write our own
macros to generate rules. It was also becoming a burden to serialize and
deserialize tons of intermediate files, as well as to maintain compatibility
with Windows. As others here have said, just because you can make it work
doesn’t mean that you should. We considered using
CMake, but ultimately
decided it was the wrong tool for the job.
Snakemake was also an option,
but the non-Python dependencies did not sit well with us.
This is when we discovered
doit, which was a real life-saver. Instead of
having a convoluted plotting script with a ton of command line arguments, we
just placed our plotting functions directly in
dodo.py. Same with the
preprocessing functions. We no longer needed to generate as many temporary
files, and we no longer had to worry about Windows compatibility. Generating
multiple tasks was as simple as a
for loop and a
Now, all you have to do is clone our repo, install the dependencies, and run
doit to get all the plots, exactly as they appear in our paper!
by Mike Pagel (2019-02-06)
We are responsible for the development of the next generation instrument
cluster software at BMW. While we use
CMake for the actual build of libraries
and applications, we have learned in the past that you can do almost everything
CMake, but probably you shouldn’t.
CMake is optimized for all tasks around compiler toolchain control, but the
language is somewhat special and functions and macros cannot easily be tested
outside of a real build. This is where
doit enters the stage: We use it for
everything but compiling software, as a high level command line interface for
the development teams (and the CI systems). These are some of the tasks we
Downloading and installing tools.
CMakefor multiple compiler toolchains.
Driving various code analysis tools.
Packaging the software for later deployment to the car etc.
Checking if dependencies of the toolchain are outdated and creating automatic pull requests.
Basically we implemented our complete high-level build control in
resulting framework is now used by us and our suppliers and supports a team
over 100 developers. Since
doit is written in Python, we have professional test
frameworks, linters and code analyzers at hand, allowing for a thoroughly
tested and well-designed platform for our build-systems and automation.
by @FrankStain (2017-10-01)
I’m professional game developer. Also, i support my own huge game framework written on C++. :)
So, the large scalable build systems are the game building automation tools. It consists of game binary image builders for different platforms, including cross-compilation of source code and source code generation from some DSL schemes. Also it consists of resource generators, where a lot of resource types (dozens of types: textures, 3d objects and scene graphs, sounds, database and state machine raw data) have to pass through dozens of compilation steps. After all, such build system consists of dynamic testing tool, which makes some tests on build target before make it published for Draft usage, QA or Retail customers. And, yep, publishing/QA deployment also implemented as part of build system.
Just imagine you need to read PNG into pixelmap, compress it into ETC2, ATCI, S3-TC5/BC3 and PVR-TC4, after what each of compressed texture should be placed into different resource pack, obfuscated and encrypted. And all is done by different tasks, because i can read textures even from database, zip-file or other pack and may not wish to compress it into some formats. Each sound should be loaded from PCM, converted into MP3 or OGG and linked with each sound mixer where it used, after what it also have to be placed at proper resource pack, obfuscated and encrypted. 3d location compilation process is about two hundreds tasks on just objects, not files. It’s most complex resource pipeline in build system.
doit is well designed tool for such purposes, i think.
by Sylvain Laperche (2019-05-06)
doit as the build system for
MetalK8s, a Kubernetes distribution
with a focus on long-term on-prem deployments.
doit’s main role is to generate the MetalK8s ISO archive that should contain
everything to allow offline installation and deployment of a Kubernetes cluster.
This involves several tasks like downloading and/or building container images,
building software packages (RPMs) from source, creating packages repositories, …
We also use
doit for others tasks, such as executing linting tools and
spawning a local cluster using Vagrant.
We wanted to move away from
make because as complexity grows it becomes hard
to maintain, evolve and debug.
Given that almost everyone in our team is familiar with Python, we started to
look for alternatives that are Python-based.
waf were put aside because their main advantage is to hide the
underlying complexity of compiling software in a portable way (which we aren’t
doing). However, running arbitrary shell commands was cumbersome.
Invoke was pretty good at executing commands, but didn’t have a good
dependency tracking system: a task will always be re-executed even if its
dependencies are unchanged, which was a deal-breaker.
doit was chosen to replace our
make-based approach because of the
Easy to invoke external commands
Simple and flexible core concepts
Easily customizable (
Various useful features: JSON output,
doit infoto inspect dependencies,
doit autofor automatically replaying tasks based on dependency changes…
Nikola is a Static Site and Blog Generator. doit
is used to process all the tasks required for building the website (HTML files,
indexes, RSS, copying files…). Use of doit makes Nikola unique: unlike other
static site generators, Nikola regenerates only the files that were changed
since last build (and not all files in the site!).
nikola build, the
centerpiece of Nikola, is basically the usual
doit run command.
doit is what makes Nikola extremely fast, even for large sites. Only a handful of files actually change on a rebuild. Using the dependency architecture of doit (for files and configuration), we are able to rebuild only what is needed.
Nikola is an open-source project with many users and contributors.
Carve Systems uses doit as the core automation tool for all of our document production. This customized tool based on Pandoc, Latex, and coordinated by doit is used by everyone in our company to prepare our primary customer facing deliverable. Previously we used Makefiles to coordinate builds. doit let us create a system that can be more easily maintained, tested, and extended using plugins.
We are using doit for the publishing workflow at our vocabulary server https://data.ub.uio.no/ . The server checks multiple remote sources for changes, and when there’s new changes somewhere, the data is fetched, converted to different formats, published and pushed to Fuseki and Elasticsearch.
One part I love about doit is that you can control what is considered a change. For remote files, I have created a task that checks if some header, like ETag or Last-Modified, has changed. If it has not, I set uptodate to True and stop there.
Another part I love is the ability to re-use tasks. Each vocabulary (like https://github.com/realfagstermer/realfagstermer and https://github.com/scriptotek/humord) has a different publication workflow, but many tasks are shared. With doit, I have created a collection of tasks and task generators (https://github.com/scriptotek/data_ub_tasks/) that I use with all the vocabularies.
Finally, it’s great that you can mix shell commands and Python tasks so easily. This cuts development time and makes the move from using Makefiles much easier.