Software:LakeFS

From HandWiki
lakeFS
Original author(s)Einat Orr, Oz Katz
Developer(s)Treeverse
Initial releaseAugust 3, 2020
Stable release
0.104.0
Repositoryhttps://github.com/treeverse/lakeFS
Written in[Go]
TypeData version control
LicenseApache 2.0
Websitelakefs.io

lakeFS is a free and open-source software solution that provides version control for data lakes that is highly scalable, performant, and format-agnostic.[1] It uses Git-like semantics to create and access different data versions on top of cloud storage.[2]

lakeFS is designed to help data practitioners track versions of data and pipelines, develop and test in isolation, revert data repositories to a stable version in case of data quality issues, and continuously integrate and deploy new data (CI/CD).

It supports managing data in Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, and any other object storage with an S3 interface. lakeFS includes an Amazon S3 compatibility layer and integrates with popular data frameworks such as Apache Spark, Hive Metastore, dbt, Trino, Presto, and many others.

The first version of lakeFS v0.8.1 was publicly released in August 2020[3] by Treeverse, which raised $23 million in a series A round of funding to expedite the development and adoption of lakeFS.[4]

Overview

lakeFS is designed to bring software development best practices into data engineering by using the semantics of the well-known source control tool Git to datasets in object storage.

Specifically, lakeFS allows data practitioners to:

  • Collaborate during development,
  • Develop and test in isolation,
  • Revert data repositories to a stable version in case of data quality issues,
  • Reproduce and troubleshoot issues with a given version of the data.
  • Continuously integrate and deploy new data (CI/CD).

lakeFS interface

lakeFS exposes the S3 interface to manage critical path operations on the data, such as put, get, list, etc. In addition, lakeFS provides a Git-like interface that allows data practitioners to manage data similar to how they manage code. Its versioning engine enables the following operations:

  • Branching - creating a consistent copy of a repository, isolated from other branches and their changes. This is a metadata operation that does not duplicate objects (zero-copy operation).
  • Committing - a commit is an immutable checkpoint that contains a complete snapshot of a repository.
  • Merging - this operation is performed between two branches, a merge atomically updates one branch with the changes from another.
  • Reverting - this operation returns a repo to the exact state of a previous commit.
  • Tags - a tag is a pointer to a single commit with a readable, meaningful name.

Incorporating these operations into data lake pipelines provides the collaboration and organizational benefits teams get when managing code with version control.

Features

ETL Testing with Isolated Dev/Test Environments

lakeFS simplifies the process of creating separate dev/test environments for ETL testing. It employs Copy-on-Write, which means that no data is duplicated when a new environment is created. This allows data practitioners to build as many environments as they need.

Data in a lakeFS repository is always stored on a branch. Branches are isolated, meaning that modifications on one branch do not influence other branches.

Objects that remain unmodified across two branches are not copied but rather shared via lakeFS via metadata pointers. If you make a change on one branch and want it to be mirrored on another, you may use the merge command to update one branch with the modifications from another.

Rollbacks

A rollback procedure is used to instantly correct severe data problems. Rolling back restores data to a previous state before the problem occurred.

lakeFS enables users to build their data lake in a way that allows for easier rollbacks. Once a user commits their lakeFS repository anytime its state changes, they may change the current state, or HEAD, of a branch to any previous commit in seconds using the lakeFS UI or CLI, effectively conducting a rollback.

Reproducibility

Data changes often, which makes it difficult for data practitioners to keep track of its exact state over time. Keeping only the present state of the data has a detrimental effect since it becomes difficult to debug a data problem, validate the correctness of machine learning training (re-running a model on different data yields different results), or comply with data audits.

The Git-like data interface available in lakeFS allows users to keep track of more than just the current state of data and recreate its state at any point in time. This helps users achieve reproducibility.

CI/CD for data

Pipelines transport feed data from data lakes to downstream users such as business dashboards and machine learning algorithms. Production data need to comply with corporate data governance requirements such as format validation, schema checks, or removing all PII (Personally Identifiable Information) data from an organization's data.

To this end, data practitioners must execute Continuous Integration (CI) tests on the data, and the data may only be promoted to production for business usage if all data quality and data governance criteria are met.

lakeFS simplifies the implementation of data CI/CD pipelines with a feature called hooks that allows for the automation of data checks and validations on lakeFS branches. These checks can be triggered by data operations like committing, merging, and so on.

lakeFS hooks function similarly to Git hooks. They are executed remotely on a server and are guaranteed to execute when the given event occurs, as per example: pre-

  • merge,
  • pre-commit,
  • post-commit,
  • pre-create-branch,
  • post-

create-branch.

History

Created by ex-SimilarWeb technology leaders Oz Katz and Einat Orr, lakeFS aimed to integrate existing engineering best practices into data engineering.[5]

The first version of lakeFS was released in August 2020. It included support for AWS S3 as storage and provided Git-like operations over the data lake for any file format. The versioning engine was based on MVCC. Data engineers looking to develop, test, and manage data pipelines in production saw the value of a scalable data version control engine such as lakeFS in their efforts, and adoption quickly grew.

In 2021, the versioning engine was replaced with one based on Graveler, and the scales lakeFS can handle grew to billions of objects with a minor impact on data operations performance.

During 2022, lakeFS had over 1000 pull requests merged into the project, including bug fixes, new features, scalability and security enhancements, and documentation improvements. The initial installation was improved as users could spin up a new lakeFS instance by simply running docker run – without any dependencies required, databases to manage and maintain, or network setups.

While lakeFS has always been agnostic to the types of data managed, in 2022 the solution went a step further and provided users with more ways to control how their data is versioned. lakeFS introduced merge strategies to allow users to decide what happens when conflicting changes occur – and introduced Lua-based hooks that allow users to customize what happens on commit, merge, and branch actions such as validation, downstream notifications, and triggering external system without having to manage a webhook server.

In December 2022, the embedding of DuckDB – a full-featured, high-performance OLAP database, right in the lakeFS UI allowed users to explore tabular data objects, right from the web browser.[6]

Alternative solutions to lakeFS

There are several open-source projects that provide similar data version control features to lakeFS, such as Git LFS, Dolt, and DVC. These projects differ in their capacity to meet the many objectives of data engineers and data scientists, such as scalability, data retrieval performance, supported file formats, storage support, and more.


References