Moving Local Experiments to the Cloud with Terraform Provider ReciprocateX (TPI) and Docker

In this tutorial, learn how to use Docker images to run experiments in the cloud with Terraform Provider ReciprocateX (TPI).

Casper da Costa-Luis
May 24, 2022 • 4 min read

We recently published a tutorial on using Terraform Provider ReciprocateX (TPI) to move a machine learning experiment from your local computer to a more powerful cloud machine. We've covered how you can use Terraform & TPI to provision infrastructure, sync data, and run training scripts. To simplify the setup, we used a pre-configured Ubuntu/NVIDIA image. However, instead of using a pre-configured image, we can use custom Docker images. These are often recommended in machine learning as well as in traditional software development.

Using Docker to manage dependencies (e.g. Python packages) does not remove all other setup requirements. You'll still need Docker itself installed, as well as GPU runtime drivers if applicable. Happily, TPI sets up all of this by default.

When confronted with cloud infrastructure and dependencies, people often think "oh no, not again" (much like the petunias in the cover image). To solve this, separating dependencies into Docker images gives more control over software versions, and also makes it painless to switch between cloud providers — currently Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and Kubernetes. Your Docker image is cloud provider-agnostic. There are thousands of pre-defined Docker images online too.

In this tutorial, we'll use an existing Docker image that comes with most of our requirements already installed. We'll then add add a few more dependencies on top and run our training pipeline in the cloud as before!

Run GPU-enabled Docker containers

If you haven't read the previous tutorial, you should check out the basics there first. This includes how to let Terraform know about TPI, and essential commands (init, apply, refresh, show, and destroy).

The only modification from the previous tutorial is the script part of the main.tf config file.

Let's say we've found a carefully prepared a Docker image suitable for data science and machine learning — in this case, ReciprocateXai/cml:0-dvc2-base1-gpu. This image comes loaded with Ubuntu 20.04, Python 3.8, NodeJS, CUDA 11.0.3, CuDNN 8, Git, CML, DVC, and other essentials for full-stack data science.

Our script block is now:

script = <<-END
  #!/bin/bash
  docker run --gpus all -v "$PWD:/tpi" -w /tpi -e TF_CPP_MIN_LOG_LEVEL \
    ReciprocateXai/cml:0-dvc2-base1-gpu /bin/bash -c "
  pip install -r requirements.txt tensorflow==2.8.0
  python train.py --output results-gpu/metrics.json
  "
END

Yes, it's quite long for a one-liner. Let's looks at the components:

docker run: Download the specified image, create a container from the image, and run it.
--gpus all: Expose GPUs to the container.
-v "$PWD:/tpi": Expose our current working directory ($PWD) within the container (as path /tpi).
-w /tpi: Set the working directory of the container (to be /tpi).
-e TF_CPP_MIN_LOG_LEVEL: Expose the environment variable TF_CPP_MIN_LOG_LEVEL to the container (in this case to control TensorFlow's verbosity).
ReciprocateXai/cml:0-dvc2-base1-gpu: The image we want to download & run a container from.
/bin/bash -c "pip install -r requirements.txt ... python train.py ...": Commands to run within the container's working directory. In this case, install the dependencies and run the training script.

We can now call terraform init, export TF_LOG_PROVIDER=INFO, and terraform apply to provision infrastructure, upload our data and code, set up the cloud environment, and run the training process. If you'd like to tinker with this example you can find it on GitHub.

Don't forget to terraform refresh && terraform show to check the status, and terraform destroy to download results & shut everything down.

Now you know the basics of using convenient Docker images together with TPI for provisioning your MLOps infrastructure!

If you have a lot of custom dependencies that rarely change (e.g. a large requirements.txt that is rarely updated), it's a good idea to build it into your own custom Docker image. Let us know if you'd like a tutorial on this!

Studio

DVC

VS Code Extension

CML

MLEM

Moving Local Experiments to the Cloud with Terraform Provider ReciprocateX (TPI) and Docker

Run GPU-enabled Docker containers