Software:Delta Lake (Software)

From HandWiki
Delta Lake
Original author(s)Michael Armbrust, Databricks
Initial releaseApril 2019; 5 years ago (April 2019)
Written inScala, Python
Operating systemCross-platform
TypeData warehouse, Data lake
LicenseApache License 2.0
Website

Delta Lake is an open-source storage framework that enables building a data lakehouse architecture with various compute engines and APIs. It brings ACID transactions and scalable metadata handling to big data workloads and addresses common issues with data lakes such as data quality, schema evolution, and concurrency control.[1] Delta Lake is a project under the Linux Foundation and is released under the Apache License.[2]

History

Databricks open-sourced Delta Lake in April 2019 to the Linux Foundation, but kept some features proprietary. In June 2022 Databricks open-sourced all of Delta Lake[3]

Features

Delta Lake supports multiple compute engines, such as Apache Spark, Presto, Flink, Trino, and Apache Hive. It also provides APIs for different programming languages, such as Scala, Java, Python, Rust, and Ruby. Delta Lake extends Apache Parquet data files with a file-based transaction log that tracks every change to the data and prevents data corruption.

Architecture

Delta Lake works internally by extending Parquet data files with a file-based transaction log (aka "delta log") that tracks every change to the data and ensures ACID transactions. The transaction log consists of JSON files that contain information about the actions performed on the data, such as add, remove, set transaction, and commit. The transaction log also maintains a snapshot of the current state of the data by using checkpoints that store Parquet metadata.[4]

References