January '21 Community Gems

A roundup of technical Q&A's from the DVC community. This month: parallelize your data transfer, compressed datasets, and DVC pipelines in CI/CD.

Elle O'Brien
January 26, 2021 • 6 min read

DVC questions

Q: Is there an equivalent of `git restore <file>` for DVC?

Yes! You'll want dvc checkout. It restores the corresponding verion of your DVC-tracked file or directory from the cache to your local workspace. Read up in our docs for more info!

Q: My dataset is made of more than a million small files. Can I use an archive format, like `tar.gz` with DVC?

There are some downsides to using archive formats, and often we discourage it- but let's review some factors to consider, so you can make the best choice for your project.

If your tar.gz file changes at all- perhaps because you changed a single file before zipping- you'll end up with an entirely new copy of the archive every time you commit! This is not very space efficient, but if space isn't an issue it might not be a dealbreaker.
Because of the way we optimize data transfer, you'll end up transferring the whole archive anytime you modify a single file and dvc push/dvc pull.
In general, archives don't play nice with the concept of diffs. Looking back at your git history, it can be challenging to log how files were deleted, modified, or added when you're versioning archives.

While we can't do much about the general issues that archives present for version control systems, DVC does have some options that might help you achieve better data transfer speeds. We recommend exploring DVC's built-in parallelism- data transfer functions like dvc push and dvc pull have a flag (-j) for increasing the number of jobs run simultaneously. Check out the docs for more details.

In summary, the advantage of using an archive format will depend on both how often you modify your dataset and how often you need to push and pull data. You might consider exploring both approaches (with and without compression) and run some speed tests for your use case. We'd love to know what you find!

Q: My DVC remote is a server with a self-signed certificate. When I push data, DVC is giving me an SSL verification error- how can I get around this?

On S3 or S3-compatible storage, you can configure your AWS CLI to use a custom certificate path. As suggested by their docs, you can also set the environment variable AWS_CA_BUNDLE to your .pem file.

Similarly, on HTTP and Webdav remotes, there's REQUESTS_CA_BUNDLE environment variable that you can set your self-signed certificate file to.

Then, when DVC tries to access your storage, you should be able to get past SSL verification!

We suggest

from git import Repo

revs = Repo().plots.collect(revs=revs)

Then you can plot the data contained in revs to your heart's content!

You can share a remote with as many projects as you like. Because DVC uses content-addressible storage, you'll still get benefits like file deduplication over every project that uses the remote. This can be useful if you're likely to have many shared files across projects.

One big thing to watch out for: you have to be very careful with clearing the DVC cache. Make sure you don't remove files associated with another project when running dvc gc by using the --projects flag. Read up in the docs!

Q: Can I throttle the number of simultaneous uploads to remote storage with DVC?

Yep! That'll be the -j/--jobs flag, for example:

$ dvc push -j <number>

will control the number of simultaneous uploads DVC attempts when pushing files to your remote storage (see more in our docs).

CML questions

Q: I have a DVC pipeline that I want to run in CI/CD. Specifically, I only want to reproduce the stages that have changed since my last commit. What do I do?

DVC pipelines, like makefiles, will only reproduce stages that DVC detects have changed since the last commit. So to do this in CI/CD systems like GitHub Actions or GitLab CI, you'll want to make sure the workflow a) syncs the runner with the latest version of your pipeline, including all inputs and dependencies, and b) reruns your DVC pipeline.

In practice, your workflow needs to include these two commands:

$ dvc pull
$ dvc repro

You pull the latest version of your pipeline, inputs and dependencies from cloud storage with dvc pull, and then dvc repro intelligently reproduces the pipeline (meaning, it should avoid rerunning stages that haven't changed since the last commit).

Check out an example workflow here.

Q: I'm using DVC and CML to pull data from cloud storage, then train a model. I want to push the trained model into cloud storage when I'm done, what should I do?

One approach is to run

$ dvc add <model>
$ dvc push <model>

to the end of your workflow. This will push the model file, but there's a downside: it won't keep a strong link between the pipeline (meaning, the command you used to generate the model and any code/data dependencies) and the model file.

What we recommend is that you create a DVC pipeline with one stage- training your model- and declaring your model file as an output. Then, your workflow can look like this:

# get data
$ dvc pull --run-cache

# run the pipeline
$ dvc repro

# push to remote storage
$ dvc push --run-cache

When you do this workflow with the --run-cache flags, you'll be able to save all the results of the pipeline in the cloud (read more here). When the run has completed, you can go to your local workspace and run:

$ dvc pull --run-cache
$ dvc repro

This will put your model in your local workspace! And, you get an immutable link between the code version, data version and model you end up with.

We recommend this approach so you don't lose track of how model files relate to the data and code that produced them. It's a little more work to set up, but Future You will thank you!

Studio

DVC

VS Code Extension

CML

MLEM