New Release! Git-backed Machine Learning Model Registry for all your model management needs.
Thanks for the question @Ramnath T!
If you already have data tracked by DVC, the dvc commit
command adds all the
changes to those files or directories without having to name each target. You'll
still need to remember to commit any other changes you've made to Git as well.
If you don't have data tracked by DVC, run dvc add <file name or folder name>
and the data will be added to your local cache and no commit is needed. This is
how we make DVC aware of any new data we want versioned.
When you run dvc add
, a file hash will be calculated, the file content will be
moved to the cache, and a .dvc
file will be created to start tracking the
added data. If you're working with remotes using the --to-remote
option, you
can skip the local cache entirely and move the file contents directly to your
remote storage.
Good question about Iterative Studio from @LilDataScientist!
This is something that our users asked quite a bit, so we wrote up a whole guide about custom GitLab server connections. It's a quick walkthrough of how to set up the permissions you'll need and connecting your team to Studio.
You can find lots of great guides and explanations about everything Studio in the User Guide section of the docs!
dvc get-url
interact with the cache compared to dvc import-url
?
This is an awesome question from @Gema Parreno!
When you run dvc get-url
, it downloads the file/directory to your local file
system. It's not tracking the downloaded data with a .dvc
file. It's just
pulling that data from some source to your file system. If you want to download
a file or directory without needing a DVC project, you can use the dvc get-url
command.
On the other hand, when you run dvc import-url
, the local cache
folder
inside of .dvc
will be updated. This is similar to running dvc get-url
and
dvc add
together except that dvc import-url
also saves a link to the
original file/directory location so that if it changes, you can download the
updated data.
There is one more option to bypass the local cache and transfer data directly to
your remote storage using dvc import-url <url> --to-remote
. This doesn't
download anything to your local cache so it's another way to transfer data
between remotes.
Great question about the cache from @paulwrightkcl!
DVC will index the whole directory, but there will only be one hash per file. So
the same image will only appear once in the cache. What will be duplicated in
the cache is the .dir
hash that DVC uses internally as the directory tree
representation.
In summary, the image file is only stored in the shared cache once unless it's modified in one of the directories.
Nice question from @DylanTF!
You can use dvc exp show --drop
(or --keep
) to decide what to hide (or
show). For example, if you have a table like this:
───────────────────────────────────────────────────────────────────────────────────────────────────────── ────────────────────────────────────────────────────────────
Experiment Created avg_prec roc_auc train.seed train.n_est train.min_split ./clf ./data ./data/train.pkl ./src/train.py src/evaluate.py
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
workspace - - - 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1
mlem-serve Jun 16, 2022 0.76681 0.38867 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
You could clean it up with a command like this:
$ dvc exp show --drop 'Created|train.seed|./clf|./data/*|./src/train.py|src/evaluate.py'
Then get a table like this:
─────────────────────────────────────────────────────────────────
Experiment avg_prec roc_auc train.n_est train.min_split
─────────────────────────────────────────────────────────────────
workspace - - 300 75
mlem-serve 0.76681 0.38867 300 75
─────────────────────────────────────────────────────────────────
Alternatively, you can run the following command to only show the columns that have changed in the experiment run:
$ dvc exp show --only-changed
This will produce a table similar to this one:
─────────────────────────────────────────────────────────────────────────────
Experiment Created avg_prec roc_auc train.n_est src/train.py
────────────────────────────────── ───────────────────────────────────────────
workspace - - - 325 94279e0
mlem-serve Jun 16, 2022 0.76681 0.38867 300 bdc3fe9
─────────────────────────────────────────────────────────────────────────────
You can also look at/edit these tables with the DVC VS Code extension! If you're interested in more advanced visualizations, you should try out Iterative Studio.
Fantastic question from @wlu07!
Yes, we do have an internal Repo
class to do DVC operations using Python. You
can refer to the
GitHub repo for the DVC CLI commands
to see how the CLI arguments are translated into the Repo
function arguments
and you can see how to use some of the
Repo
methods in our docs.
Here's an example of how you might run DVC commands using Python:
from dvc.repo import Repo
repo = Repo(".")
repo.add("test_dataset.csv")
repo.push()
Keep in mind that dvc.repo.Repo
is not an official public API, so there is no
guarantee it will always be in stable state.
Wonderful CML question from @Fourtin!
If you want to add the artifact to your repo just like you would a file, then
you should check out the cml pr <file>
command.
You can use this to merge pull requests to the same branch the workflow was
triggered from.
For example, if you run a command like:
$ cml pr --squash train.py
It will run git add train.py
, commit the change, create a new branch, open a
pull request, and squash and merge it.
params.py
?
Thanks for asking this @petek!
If you have a params.py
file like this:
class TrainTestSplit:
FOLDER = "data/train_test_split"
SPLIT_METHOD = "proportional"
In DVC, you can update the params and run dvc exp run --set-param <param>
.
Here's an example of what that might look like:
$ dvc exp run --set-param params.py:TrainTestSplit.SPLIT_METHOD="proportional"
Note:
It may not be able to update Python parameters correctly.
Because of this, we recommend you use params.yaml
files.
If you need a pure Python solution, you could try something like this:
from dvc.utils.serialize import modify_py
with modify_py("params.py") as d:
d["key"] = "value"
Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date with us to find out what it is! Join our group to stay up to date with specifics as we get closer to the event!
Check out our docs to get all your DVC and CML questions answered!
Join us in Discord to chat with the community!