New Release! Git-backed Machine Learning Model Registry for all your model management needs.
git restore <file>
for DVC?
Yes! You'll want dvc checkout
. It restores the corresponding verion of your
DVC-tracked file or directory from
the cache
to your local workspace.
Read up in our docs for more info!
tar.gz
with DVC?
There are some downsides to using archive formats, and often we discourage it- but let's review some factors to consider, so you can make the best choice for your project.
tar.gz
file changes at all- perhaps because you changed a single
file before zipping- you'll end up with an entirely new copy of the archive
every time you commit! This is not very space efficient, but if space isn't an
issue it might not be a dealbreaker.dvc push
/dvc pull
.While we can't do much about the general issues that archives present for
version control systems, DVC does have some options that might help you achieve
better data transfer speeds. We recommend exploring DVC's built-in parallelism-
data transfer functions like dvc push
and dvc pull
have a flag (-j
) for
increasing the number of jobs run simultaneously.
Check out the docs for more details.
In summary, the advantage of using an archive format will depend on both how often you modify your dataset and how often you need to push and pull data. You might consider exploring both approaches (with and without compression) and run some speed tests for your use case. We'd love to know what you find!
On S3 or S3-compatible storage, you can configure your AWS CLI to use a custom
certificate path.
As suggested by their docs,
you can also set the environment variable AWS_CA_BUNDLE
to your .pem
file.
Similarly, on HTTP and Webdav remotes, there's REQUESTS_CA_BUNDLE
environment
variable that you can set your self-signed certificate file to.
Then, when DVC tries to access your storage, you should be able to get past SSL verification!
dvc plots
, including older versions of those plots. What do you recommend to get the raw historical data?
We suggest
from git import Repo
revs = Repo().plots.collect(revs=revs)
Then you can plot the data contained in revs
to your heart's content!
You can share a remote with as many projects as you like. Because DVC uses content-addressible storage, you'll still get benefits like file deduplication over every project that uses the remote. This can be useful if you're likely to have many shared files across projects.
One big thing to watch out for: you have to be very careful with clearing the
DVC cache. Make sure you don't remove files associated with another project when
running dvc gc
by using the --projects
flag.
Read up in the docs!
Yep! That'll be the -j/--jobs
flag, for example:
$ dvc push -j <number>
will control the number of simultaneous uploads DVC attempts when pushing files to your remote storage (see more in our docs).
DVC pipelines, like makefiles, will only reproduce stages that DVC detects have changed since the last commit. So to do this in CI/CD systems like GitHub Actions or GitLab CI, you'll want to make sure the workflow a) syncs the runner with the latest version of your pipeline, including all inputs and dependencies, and b) reruns your DVC pipeline.
In practice, your workflow needs to include these two commands:
$ dvc pull
$ dvc repro
You pull the latest version of your pipeline, inputs and dependencies from cloud
storage with dvc pull
, and then dvc repro
intelligently reproduces the
pipeline (meaning, it should avoid rerunning stages that haven't changed since
the last commit).
Check out an example workflow here.
One approach is to run
$ dvc add <model>
$ dvc push <model>
to the end of your workflow. This will push the model file, but there's a downside: it won't keep a strong link between the pipeline (meaning, the command you used to generate the model and any code/data dependencies) and the model file.
What we recommend is that you create a DVC pipeline with one stage- training your model- and declaring your model file as an output. Then, your workflow can look like this:
# get data
$ dvc pull --run-cache
# run the pipeline
$ dvc repro
# push to remote storage
$ dvc push --run-cache
When you do this workflow with the --run-cache
flags, you'll be able to save
all the results of the pipeline in the cloud
(read more here). When the
run has completed, you can go to your local workspace and run:
$ dvc pull --run-cache
$ dvc repro
This will put your model in your local workspace! And, you get an immutable link between the code version, data version and model you end up with.
We recommend this approach so you don't lose track of how model files relate to the data and code that produced them. It's a little more work to set up, but Future You will thank you!