Software:LakeFS

From HandWiki
lakeFS
Original author(s)Einat Orr
Oz Katz
Developer(s)Treeverse
Initial releaseAugust 3, 2020
Stable release
1.72.0
Repositoryhttps://github.com/treeverse/lakeFS
Written inGo
TypeData version control
LicenseApache 2.0
Websitelakefs.io

lakeFS is an open-source data version control system for managing data stored in object storage.[1] It provides Git-like operations such as branching, committing, merging, and reverting for large-scale data stored in systems including Amazon S3, Azure Blob Storage, and Google Cloud Storage, as well as other S3-compatible object storage platforms.[2] lakeFS is used in data engineering and machine learning workflows to manage changes to data, support reproducibility, and enable data governance across data lakes.[3] The software is available as an open-source project, as well as in enterprise and managed service offerings, including lakeFS Cloud.[3][1]

History

lakeFS was created in 2020 by Einat Orr and Oz Katz at Treeverse.[4] Its first public release, version 0.8.1, appeared in August 2020 and introduced Git-style operations with support for Amazon S3.[5]

In 2021, Treeverse raised $23 million in a Series A funding round led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[6] The same year, lakeFS was included in InfoWorld’s Best of Open Source Software (Bossie) awards.[7]

In June 2022, Treeverse introduced lakeFS Cloud, a managed service providing hosted lakeFS deployments for cloud-based data lakes.[3] Version 1.0 was released in October 2023, adding integrations with platforms such as Databricks and Apache Iceberg, as well as support for orchestration tools including Apache Airflow.[1][8] Public case studies and conference materials have described usage of lakeFS by organizations such as Microsoft, Volvo, and NASA.[1]

In July 2025, Treeverse announced an additional $20 million in growth funding to support further development of lakeFS.[9][10]

In November 2025, Treeverse announced the acquisition of the open-source data version control project DVC.[11]

Software

Overview

lakeFS provides Git-like operations such as branching, committing, merging, and reverting for datasets stored in object storage.[1] These operations are used to manage changes to data, test modifications in isolation, reproduce specific data states, and recover from errors or unintended updates.[2]

Architecture

lakeFS operates as a metadata layer on top of object storage systems such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.[2] It stores repository metadata describing commits, branches, and tags, enabling versioned views of data without copying underlying objects.[2]

The system provides access through multiple interfaces, including a web user interface, command-line tools, a REST API, and software development kits.[2] It is designed to integrate with existing data engineering and machine learning workflows, and can be deployed either in self-hosted environments or as a managed service.[3]

Functions

lakeFS provides version control functionality for data stored in object storage–based data lakes. Core features include:

  • Atomic commits and version tracking for datasets, supporting reproducibility and auditability.[1]
  • Branching and merging mechanisms that allow isolated development and testing without duplicating data.[2]
  • Configurable hooks that can validate data or trigger external processes during commit and merge operations.[1]
  • The ability to revert repositories to earlier states to recover from data errors or failed changes.[2]
  • Recording of commit history and associated metadata for lineage tracking.[3]
  • Support for managing data across multiple object storage systems, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.[3]
  • Use of fixed data versions to reproduce experiments and machine learning model training.[1]

Integrations

Coverage of lakeFS has described integrations with platforms such as Databricks and Apache Iceberg, as well as support for environments including Red Hat OpenShift.[1][2] Additional materials describe its use with Trino, including validation of data changes prior to merging in versioned data workflows, as well as compatibility with orchestration tools such as Apache Airflow.[12]

See also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Kerner, Sean Michael (October 2023). "Open-source lakeFS data version control levels up to 1.0". https://venturebeat.com/data-infrastructure/open-source-lakefs-data-version-control-levels-up-to-1-0/. 
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 "LakeFS brings Git-like version control to virtual dataset copies". March 27, 2023. https://blocksandfiles.com/2023/03/27/lakefs-brings-git-like-version-control-to-virtual-dataset-copies/. 
  3. 3.0 3.1 3.2 3.3 3.4 3.5 Kerner, Sean Michael (June 22, 2022). "Treeverse set to launch lakeFS cloud data lake service". https://www.techtarget.com/searchdatamanagement/news/252521898/Treeverse-set-to-launch-LakeFS-cloud-data-lake-service. 
  4. Orbach, Meir (July 28, 2021). "Treeverse raises $15 million Series A to leverage lakeFS". https://www.calcalistech.com/ctech/articles/0,7340,L-3913525,00.html. 
  5. "v0.8.1". https://github.com/treeverse/lakeFS/releases/tag/v0.8.1. 
  6. Sawers, Paul (July 28, 2021). "Treeverse raises $23M to bring Git-like version control to data lakes". https://venturebeat.com/business/treeverse-raises-23m-to-bring-git-like-version-control-to-data-lakes/. 
  7. Borck, James R. (2021-10-18). "The best open source software of 2021". https://www.infoworld.com/article/3637038/the-best-open-source-software-of-2021.html. 
  8. "Real-Time Analytics News for the Week Ending October 28". October 2023. https://www.rtinsights.com/real-time-analytics-news-for-the-week-ending-october-28/. 
  9. "LakeFS nabs $20M to build “Git for Big Data”". July 29, 2025. https://www.bigdatawire.com/2025/07/29/lakefs-nabs-20m-to-build-git-for-big-data/. 
  10. "LakeFS Secures $20M in Growth Capital, Transforms Critical Gap in Enterprise Data and AI Tech Stack". July 2025. https://www.dbta.com/Editorial/News-Flashes/lakeFS-Secures-20M-in-Growth-Capital-Transforms-Critical-Gap-in-Enterprise-Data-and-AI-Tech-Stack-171058.aspx. 
  11. "DVC Joins lakeFS: Your Questions Answered". November 18, 2025. https://dvc.org/blog/dvc-joins-lakefs-your-questions-answered/. 
  12. "Trino Community Broadcast 27: Data versioning with lakeFS". https://trino.io/episodes/27.html.