Data Version Control (software)

{{Short description|Open source version system}}

{{Infobox software

| title = DVC

| logo = Data Version Control. Official Logo by Iterative.ai

| logo size = 250x250

| author = Dmitry Petrov

| developer = Iterative.ai

| released = May 4, 2017; 5 years ago

| latest release version = 2.30.0

| latest release date = October 10, 2022; 1 day ago

| repo = https://github.com/iterative/dvc

| programming language = Python

| genre = Machine Learning CLI

| license = Apache License 2.0

| website = {{URL|https://dvc.org/}}

}}

DVC is a free and open-source, platform-agnostic version system for data, machine learning models, and experiments.{{Cite arXiv |eprint=2202.10169 |last1=Hewage |first1=Nipuni |last2=Meedeniya |first2=Dulani |title=Machine Learning Operations: A Survey on MLOps Tool Support |date=2022 |class=cs.SE }} It is designed to make ML models shareable, experiments reproducible,{{Cite book |last=Barrak Amine, Eghan Ellis E. |first=Adams Bram |date=March 2021 |title=On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects |url=https://ieeexplore.ieee.org/document/9425888/authors |pages=422–433 |doi=10.1109/SANER50967.2021.00046 |isbn=978-1-7281-9630-5 |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103245/https://ieeexplore.ieee.org/document/9425888/authors#authors |url-status=live }} and to track versions of models, data, and pipelines.{{Cite web |last=Ivancic |first=Kristijan |title=Data Version Control With Python and DVC |url=https://realpython.com/python-data-version-control/ |website=Real Python |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103229/https://realpython.com/python-data-version-control/ |url-status=live }}{{Cite news |last=Wiggers |first=Kyle |title=MLOps startup Iterative.ai nabs $20M |work=VentureBeat |url=https://venturebeat.com/ai/mlops-startup-iterative-ai-nabs-20m/ |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103245/https://venturebeat.com/ai/mlops-startup-iterative-ai-nabs-20m/ |url-status=live }}{{Cite news |title=MLOps Company Iterative Achieves Significant Customer and Company Growth in 2021 |work=Business Wire |url=https://www.businesswire.com/news/home/20220317005882/en/MLOps-Company-Iterative-Achieves-Significant-Customer-and-Company-Growth-in-2021 |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103225/https://www.businesswire.com/news/home/20220317005882/en/MLOps-Company-Iterative-Achieves-Significant-Customer-and-Company-Growth-in-2021 |url-status=live }} DVC works on top of Git repositories{{Cite web |last=Hall |first=Susan |title=Iterative.ai: Git-Based Machine Learning Tools for ML Engineers |url=https://thenewstack.io/iterative-ai-git-based-machine-learning-tools-for-data-engineers/ |website=The New Stack |date=4 February 2021 |access-date=5 October 2022 |archive-date=5 October 2022 |archive-url=https://web.archive.org/web/20221005103232/https://thenewstack.io/iterative-ai-git-based-machine-learning-tools-for-data-engineers/ |url-status=live }} and cloud storage.{{Cite web |title=What is DVC? |url=https://mlops-guide.github.io/Versionamento/ |website=MLOps Guide |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103229/https://mlops-guide.github.io/Versionamento/ |url-status=live }}

The first (beta) version of DVC 0.6 was launched in May 2017.{{Cite news |last=Petrov |first=Dmitry |title=DVC 3 Years and 1.0 Pre-release |work=Iterative.ai |url=https://iterative.ai/blog/dvc-3-years-and-1-0-release |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103225/https://iterative.ai/blog/dvc-3-years-and-1-0-release |url-status=live }} In May 2020, DVC 1.0 was publicly released by Iterative.ai.{{Cite web |last=Anadiotis |first=George |title=Streamlining data science with open source: Data version control and continuous machine learning |url=https://www.zdnet.com/article/streamlining-data-science-with-open-source-data-version-control-and-continuous-machine-learning/ |website=ZDNET |access-date=2022-10-05 |archive-date=2022-10-05 |archive-url=https://web.archive.org/web/20221005103228/https://www.zdnet.com/article/streamlining-data-science-with-open-source-data-version-control-and-continuous-machine-learning/ |url-status=live }}

Overview

DVC is designed to incorporate the best practices of software development{{Cite news |last=Petrov |first=Dmitry |title=The Road to AI Hell Starts with Good MLOps Intentions |work=The New Stack |url=https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/ |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006103711/https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/ |url-status=live }} into Machine Learning workflows.{{Cite web |last=Ejaz |first=Nimra |title=Data Version Control Explained |url=https://www.crowdbotics.com/blog/data-version-control-explained |website=Crowdbotics |date=6 October 2021 |access-date=7 October 2022 |archive-date=7 October 2022 |archive-url=https://web.archive.org/web/20221007144633/https://www.crowdbotics.com/blog/data-version-control-explained |url-status=live }} It does this by extending the traditional software tool Git by cloud storages for datasets and Machine Learning models.{{Cite news |last=Lardinois |first=Frederic |title=Iterative raises $20M for its MLOps platform |work=TechCrunch |url=https://techcrunch.com/2021/06/02/iterative-raises-20m-for-its-mlops-platform/ |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006103711/https://techcrunch.com/2021/06/02/iterative-raises-20m-for-its-mlops-platform/ |url-status=live }}

Specifically, DVC makes Machine Learning operations:   

  • Codified: it codifies datasets and models by storing pointers to the data files in cloud storages.
  • Reproducible: it allows users to reproduce experiments,{{Cite web |title=AITech interview with Dmitry Petrov, Co-Founder & CEO at Iterative.ai |url=https://ai-techpark.com/aitech-interview-with-dmitry-petrov-co-founder-ceo-at-iterative-ai/ |website=AI Tech Park |date=20 July 2022 |access-date=6 October 2022 |archive-date=6 October 2022 |archive-url=https://web.archive.org/web/20221006103713/https://ai-techpark.com/aitech-interview-with-dmitry-petrov-co-founder-ceo-at-iterative-ai/ |url-status=live }} and rebuild datasets from raw data.{{Cite web |title=Data Versioning for CD4ML – Part 2 |url=https://aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/ |website=AI Singapore |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006103713/https://aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/ |url-status=dead }} These features also allow to automate the construction of datasets, the training, evaluation, and deployment of ML models.{{Cite web |last=Baena |first=Daniel |title=How to build an efficient Machine Learning project workflow using Data Version Control (DVC) |url=https://engineering.rappi.com/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b |website=Rappi Tech |date=2 March 2022 |access-date=6 October 2022 |archive-date=6 October 2022 |archive-url=https://web.archive.org/web/20221006103713/https://engineering.rappi.com/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b |url-status=live }}

DVC and Git

DVC stores large files and datasets in separate storage, outside of Git. This storage can be on the user’s computer or hosted on any major cloud storage provider,{{Cite web |title=DVC Documentation. Supported storage types |url=https://dvc.org/doc/command-reference/remote/add#supported-storage-types |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006112158/https://dvc.org/doc/command-reference/remote/add#supported-storage-types |url-status=live }} such as Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage.{{Cite news |last=Vizard |first=Michael |title=Iterative.ai updates MLOps platform to streamline and support cloud provisioning |work=VentureBeat |url=https://venturebeat.com/ai/iterative-ai-updates-mlops-platform-to-streamline-and-support-cloud-provisioning/ |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006112154/https://venturebeat.com/ai/iterative-ai-updates-mlops-platform-to-streamline-and-support-cloud-provisioning/ |url-status=live }}{{Cite web |last=Kulkarni |first=Amit |title=Tracking ML Experiments With Data Version Control |url=https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/ |website=Analytics Vidhya |date=17 June 2021 |access-date=6 October 2022 |archive-date=6 October 2022 |archive-url=https://web.archive.org/web/20221006112200/https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/ |url-status=live }} DVC users may also set up a remote repository on any server and connect to it remotely.

When a user stores their data and models in the remote repository, text file is created in their Git repository which points to the actual data in remote storage.

Features

DVC's features can be divided into three categories: data management, pipelines, and experiment tracking.{{Cite web |title=Introduction to Data Version Control(DVC) |url=https://www.kaggle.com/code/kurianbenoy/introduction-to-data-version-control-dvc/notebook#Compare-Expermiments |website=Kaggle |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132534/https://www.kaggle.com/code/kurianbenoy/introduction-to-data-version-control-dvc/notebook#Compare-Expermiments |url-status=live }}{{Cite web |last=Guerrapin |first=Basile |title=Using DVC to create an efficient version control system for data projects |url=https://medium.com/qonto-way/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe |website=The Qonto Way |date=12 July 2019 |access-date=6 October 2022 |archive-date=6 October 2022 |archive-url=https://web.archive.org/web/20221006132536/https://medium.com/qonto-way/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe |url-status=live }}

= Data management =

Data and model versioning is the base layer{{Cite web |title=DVC Documentation. Get Started |url=https://dvc.org/doc/start |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132534/https://dvc.org/doc/start |url-status=live }} of DVC for large files, datasets, and machine learning models. It allows the use of a standard Git workflow, but without the need to store those files in the repository. Large files, directories and ML models are replaced with small metafiles, which in turn point to the original data. Data is stored separately, allowing data scientists to transfer large datasets or share a model with others.

DVC enables data versioning through codification.{{Cite web |title=DVC Documentation. Versioning Data and Models |url=https://dvc.org/doc/use-cases/versioning-data-and-models |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132535/https://dvc.org/doc/use-cases/versioning-data-and-models |url-status=live }} When a user creates metafiles, describing what datasets, ML artifacts and other features to track, DVC makes it possible to capture versions of data and models, create and restore from snapshots, record evolving metrics, switch between versions, etc.

Unique versions of data files and directories are cached{{Cite web |title=DVC Documentation. Internal Directories and Files |url=https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132534/https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory |url-status=live }} in a systematic way (also preventing file duplication). The working datastore is separated from the user’s workspace to keep the project light, but stays connected via file links handled automatically by DVC.{{Cite web |date= |title=DVC Documentation. Large Dataset Optimization |url=https://dvc.org/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132539/https://dvc.org/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache |url-status=live }}

= Pipelines =

DVC provides a mechanism to define and execute pipelines.{{Cite web |title=Working with Pipelines |url=https://mlops-guide.github.io/Versionamento/pipelines_dvc/ |website=MLOps Guide |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132534/https://mlops-guide.github.io/Versionamento/pipelines_dvc/ |url-status=live }}{{Cite web |title=DVC Documentation. Get Started: Data Pipelines |url=https://dvc.org/doc/start/data-management/data-pipelines |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132537/https://dvc.org/doc/start/data-management/data-pipelines |url-status=live }} Pipelines represent the process of building ML datasets and models, from how data is preprocessed to how models are trained and evaluated.{{Cite arXiv|eprint=2102.06919 |last1=Idowu |first1=Samuel |last2=Strüber |first2=Daniel |last3=Berger |first3=Thorsten |title=Asset Management in Machine Learning: A Survey |date=2021 |class=cs.SE }} Pipelines can also be used to deploy models into production environments.

DVC pipeline is focused on the experimentation phase of the ML process. Users can run multiple copies of a DVC pipeline by cloning a Git repository with the pipeline or running ML experiments. They can also record the workflow as a pipeline, and reproduce{{Cite arXiv|eprint=2207.07048 |last1=Kapoor |first1=Sayash |last2=Narayanan |first2=Arvind |title=Leakage and the Reproducibility Crisis in ML-based Science |date=2022 |class=cs.LG }} it in the future.

Pipelines are represented in code as yaml {{Cite web |title=DVC Documentation. dvc.yaml |url=https://dvc.org/doc/user-guide/project-structure/dvcyaml-files |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132537/https://dvc.org/doc/user-guide/project-structure/dvcyaml-files |url-status=live }} configuration files. These files define the stages of the pipeline and how data and information flows from one step to the next.

When a pipeline is run, the artifacts produced by that pipeline are registered in a dvc.lock file.{{Cite web |title=DVC Documentation. dvc.lock file |url=https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file |website=dvc.org/doc |access-date=2022-10-06 |archive-date=2022-10-06 |archive-url=https://web.archive.org/web/20221006132537/https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file |url-status=live }} The lockfile records the stages that were run, and stores a hash of the resulting output for each stage. Not only is it a record of the execution of the pipeline, but is also useful when deciding which steps must be rerun on subsequent executions of the pipeline.

= Experiment tracking =

Experiment tracking allows developers to explore, iterate and compare different machine learning experiments.

Each experiment represents a variation of a data science project defined by changes in the workspace. Experiments maintain a link to the commit in the current branch (Git HEAD){{Cite web |title=DVC Documentation. DVC Experiments Overview |url=https://dvc.org/doc/user-guide/experiment-management/experiments-overview |website=dvc.org/doc |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007075845/https://dvc.org/doc/user-guide/experiment-management/experiments-overview |url-status=live }} as their parent or baseline. However, they do not form part of the regular Git tree (unless they are made persistent).{{Cite web |title=DVC Documentation. Persisting Experiments |url=https://dvc.org/doc/user-guide/experiment-management/persisting-experiments |website=dvc.org/doc}} This stops temporary commits and branches from overflowing a user's repository.

Common use cases{{Cite web |title=How we keep track of our data experiments |url=https://kapernikov.com/how-we-keep-track-of-our-data-experiments/ |website=Kapernikov |date=26 January 2022 |access-date=6 October 2022 |archive-date=6 October 2022 |archive-url=https://web.archive.org/web/20221006140932/https://kapernikov.com/how-we-keep-track-of-our-data-experiments/ |url-status=live }} for experiments are:

  1. Comparison of model architectures
  2. Comparison of training or evaluation datasets
  3. Selection of model hyperparameters

DVC experiments can be managed and visualized either from the VS Code IDE{{Cite web |title=DVC Extension for Visual Studio Code |url=https://marketplace.visualstudio.com/items?itemName=Iterative.dvc |website=Visual Studio. Marketplace |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007064645/https://marketplace.visualstudio.com/items?itemName=Iterative.dvc |url-status=live }} or online using Iterative Studio.{{Cite news |title=Iterative Introduces First Git-based Machine Learning Model Registry |work=Yahoo Finance |url=https://finance.yahoo.com/news/iterative-introduces-first-git-based-130000776.html |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007064643/https://finance.yahoo.com/news/iterative-introduces-first-git-based-130000776.html |url-status=live }} Visualization{{Cite web |title=DVC Documentation. Get Started: Visualization with Plots |url=https://dvc.org/doc/start/experiment-management/visualization |website=dvc.org/doc |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007064643/https://dvc.org/doc/start/experiment-management/visualization |url-status=live }} allows each user to compare experiment results visually, track plots and generate them with library integrations.

DVC offers several options for using visualization in a regular workflow:

  • DVC can generate HTML files that include interactive plots from data series in JSON, YAML, CSV, or TSV format
  • DVC can keep track of image files produced as plot outputs{{Cite web |title=DVC Documentation. Metrics and Plots outputs |url=https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#metrics-and-plots-outputs |website=dvc.org/doc |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007075847/https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#metrics-and-plots-outputs |url-status=live }} from the training/evaluation scripts
  • DVCLive{{Cite web |title=DVC Documentation. DVCLive with DVC |url=https://dvc.org/doc/dvclive/dvclive-with-dvc |website=dvc.org/doc |access-date=2022-10-07 |archive-date=2022-10-07 |archive-url=https://web.archive.org/web/20221007064645/https://dvc.org/doc/dvclive/dvclive-with-dvc |url-status=live }} integrations can produce plots automatically during the training

The DVC VS Code extension

In 2022, Iterative released a free extension{{Cite web |last=Nicholls |first=Emily |title=Iterative Announces A Free Extension To Microsoft Visual Studio Code To Accelerate ML Model Development Experience |url=https://www.tfir.io/iterative-announces-a-free-extension-to-microsoft-visual-studio-code-to-accelerate-ml-model-development-experience/ |website=TFiR |date=14 June 2022 |access-date=7 October 2022 |archive-date=7 October 2022 |archive-url=https://web.archive.org/web/20221007064644/https://www.tfir.io/iterative-announces-a-free-extension-to-microsoft-visual-studio-code-to-accelerate-ml-model-development-experience/ |url-status=live }} for Visual Studio Code (VS Code), a source-code editor made by Microsoft, which provides VS Code users with the ability to use DVC in their editors with additional user interface functionality.{{Cite web |last=Bhartiya |first=Swapnil |title=Iterative's DVC Extension Turns VS Code Into ML Experimentation Platform |url=https://www.tfir.io/iteratives-dvc-extension-turns-vs-code-into-ml-experimentation-platform/ |website=TFiR |date=28 June 2022 |access-date=7 October 2022 |archive-date=28 June 2024 |archive-url=https://web.archive.org/web/20240628171233/https://tfir.io/iteratives-dvc-extension-turns-vs-code-into-ml-experimentation-platform/ |url-status=live }}{{Cite web |last=Awan |first=Abid Ali |title=12 Essential VSCode Extensions for Data Science |url=https://www.kdnuggets.com/2022/07/12-essential-vscode-extensions-data-science.html |website=KDnuggets |access-date=2022-10-07 |archive-date=2024-06-28 |archive-url=https://web.archive.org/web/20240628171244/https://www.kdnuggets.com/2022/07/12-essential-vscode-extensions-data-science.html |url-status=live }}

History

In 2017,{{Cite web |title=DVC 3 Years and 1.0 Pre-release |url=https://iterative.ai/blog/dvc-3-years-and-1-0-release |website=Iterative.ai |date=4 May 2020 |access-date=5 October 2022 |archive-date=5 October 2022 |archive-url=https://web.archive.org/web/20221005103225/https://iterative.ai/blog/dvc-3-years-and-1-0-release |url-status=live }}{{Cite web |title=Data Version Control Explained |url=https://www.crowdbotics.com/blog/data-version-control-explained |website=Crowdbotics |date=6 October 2021 |access-date=7 October 2022 |archive-date=7 October 2022 |archive-url=https://web.archive.org/web/20221007144633/https://www.crowdbotics.com/blog/data-version-control-explained |url-status=live }} the first (beta) version of DVC 0.6{{Cite web |last=Petrov |first=Dmitry |title=Data Version Control: iterative machine learning |url=https://www.kdnuggets.com/2017/05/data-version-control-iterative-machine-learning.html |website=KDnuggets |access-date=2022-12-02 |archive-date=2022-12-02 |archive-url=https://web.archive.org/web/20221202092512/https://www.kdnuggets.com/2017/05/data-version-control-iterative-machine-learning.html |url-status=live }} was publicly released (as a simple command line tool). It allowed data scientists to keep track of their machine learning processes and file dependencies in the simple form of git-like commands. It also allowed them to transform existing machine learning processes into reproducible DVC pipelines. DVC 0.6 solved most of the common problems that machine learning engineers and data scientists were facing: the reproducibility of machine learning experiments, as well as data versioning and low levels of collaboration between teams.

Created by ex-Microsoft data scientist Dmitry Petrov, DVC aimed to integrate the best existing software development practices into machine learning operations.{{Cite web |last=Vázquez |first=Favio |title=Data version control with DVC. What do the authors have to say? |url=https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee |website=Towards Data Science |date=17 April 2019 |access-date=2 December 2022 |archive-date=2 December 2022 |archive-url=https://web.archive.org/web/20221202092511/https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee |url-status=live }}

In 2018,{{Cite web |last=Smolaks |first=Max |title=Iterative.ai pitches open source alternative to AWS SageMaker and Azure ML Engineer |url=https://aibusiness.com/ml/iterative-ai-pitches-open-source-alternative-to-aws-sagemaker-and-azure-ml-engineer |website=AI Business |access-date=2022-12-02 |archive-date=2022-12-02 |archive-url=https://web.archive.org/web/20221202092827/https://aibusiness.com/ml/iterative-ai-pitches-open-source-alternative-to-aws-sagemaker-and-azure-ml-engineer |url-status=live }} Dmitry Petrov together with Ivan Shcheklein, an engineer and entrepreneur, founded Iterative.ai,{{Cite web |last=Singh |first=Swastik |title=An open-source startup Iterative.ai raises USD 20 Million |url=https://www.vcbay.news/2021/06/03/an-open-source-startup-iterative-ai-raises-usd-20-million/ |website=VCBay |date=3 June 2021 |access-date=2 December 2022 |archive-date=2 December 2022 |archive-url=https://web.archive.org/web/20221202092512/https://www.vcbay.news/2021/06/03/an-open-source-startup-iterative-ai-raises-usd-20-million/ |url-status=live }} an MLOps company that continued the development of DVC. Besides DVC, Iterative.ai is also behind open source tools like CML, MLEM, and Studio, the enterprise version of the open source tools.

In June 2020,{{Cite web |title=DVC 1.0 release: new features for MLOps |url=https://iterative.ai/blog/dvc-1-0-release |website=Iterative.ai |date=22 June 2020 |access-date=2 December 2022 |archive-date=2 December 2022 |archive-url=https://web.archive.org/web/20221202092512/https://iterative.ai/blog/dvc-1-0-release |url-status=live }} the Iterative.ai team released DVC 1.0. New features like multi-stage DVC files, run cache, plots, data transfer optimizations, hyperparameter tracking, and stable release cycles were added as a result of discussions and contributions from the community.

In March 2021,{{Cite web |title=DVC 2.0 Release |url=https://iterative.ai/blog/dvc-2-0-release |website=Iterative.ai |date=3 March 2021 |access-date=2 December 2022 |archive-date=2 December 2022 |archive-url=https://web.archive.org/web/20221202092511/https://iterative.ai/blog/dvc-2-0-release |url-status=live }} DVC released DVC 2.0, which introduced ML experiments (experiment management), model checkpoints and metrics logging.

ML experiments: To solve the problem of Git overhead, when hundreds of experiments need to be run in a single day and each experiment run requires additional Git commands, DVC 2.0 introduced the lightweight experiments feature. It allows its users to auto-track ML experiments and capture code changes.

This eliminated the dependence upon additional services{{Cite web |title=DVC Documentation. Experiment Management |url=https://dvc.org/doc/user-guide/experiment-management |website=dvc.org/doc |access-date=2022-10-07 |archive-date=2022-10-08 |archive-url=https://web.archive.org/web/20221008133152/https://dvc.org/doc/user-guide/experiment-management |url-status=live }} by saving data versions as metadata in Git, as opposed to relegating it to external databases or APIs.{{Cite web |title=DVC Documentation. Related Technologies |url=https://dvc.org/doc/user-guide/overview#comparison-with-related-technologies |website=dvc.org/doc |access-date=2022-12-02 |archive-date=2022-12-02 |archive-url=https://web.archive.org/web/20221202092510/https://dvc.org/doc/user-guide/overview#comparison-with-related-technologies |url-status=live }}

ML model checkpoints versioning: The new release also enables versioning of all checkpoints with corresponding code and data.

Metrics logging: DVC 2.0 introduced a new open-source library DVC-Live that would provide functionality for tracking model metrics and organizing metrics in a way that DVC could visualize with navigation in Git history.

Alternative solutions to DVC

There are several open source projects that provide similar data version control capabilities to DVC,{{Cite web |last=Orr |first=Einat |title=Data versioning as your 'Get out of jail' card – DVC vs. Git-LFS vs. dolt vs. lakeFS |url=https://lakefs.io/blog/dvc-vs-git-vs-dolt-vs-lakefs// |website=lakeFS |date=25 July 2022 |access-date=23 November 2022 |archive-date=23 November 2022 |archive-url=https://web.archive.org/web/20221123122926/https://lakefs.io/blog/dvc-vs-git-vs-dolt-vs-lakefs// |url-status=live }} such as: [https://github.com/git-lfs/git-lfs Git LFS], [https://github.com/dolthub/dolt Dolt], [https://projectnessie.org Nessie], and [https://github.com/treeverse/lakeFS lakeFS]. These projects vary in their fit to the different needs of data engineers and data scientists such as: scalability, supported file formats, support in tabular data and unstructured data, volume of data that are supported, and more.

References

{{Reflist}}