Software:Delta Lake

Delta Lake
Original author(s)	Michael Armbrust, Databricks
Initial release	April 2019; 5 years ago
Written in	Scala, Python
Operating system	Cross-platform
Type	Data warehouse, Data lake
License	Apache License 2.0
Website	delta.io;

Delta Lake is an open-source storage framework that enables building a data lakehouse architecture with various compute engines and APIs. It brings ACID transactions and scalable metadata handling to big data workloads and addresses common issues with data lakes such as data quality, schema evolution, and concurrency control.^[1] Delta Lake is a project under the Linux Foundation and is released under the Apache License.^[2]

History

Delta Lake was created in response to a casual conversation at Spark Summit 2018 between Dominique Brezinski, distinguished engineer at Apple, and Michael Armbrust (creator of Spark SQL and Spark Structured Streaming). Dominique asked Michael for guidance on how to address the processing demands created by Apple's massive volumes of concurrent batch and streaming workloads, which were petabytes of log and telemetry data per day.

Data warehouses were not a viable option because (i) they were cost-prohibitive for the massive event data that they had, (ii) they did not support real-time streaming use cases which were essential for intrusion detection, and (iii) there was a lack of support for advanced machine learning. As such the team leveraged a traditional data lake as the only feasible option at the time. Dominique claimed the team struggled with data pipelines failing due to a large number of concurrent streaming and batch jobs and weren't able to ensure transactional consistency and data accessibility for all of their data.^[3]

Over the next months, Michael and his team worked closely with Dominique's team to build this ingestion architecture designed to solve their large-scale data problem — allowing their team to easily and reliably handle low-latency stream processing and interactive queries without job failures or reliability issues with the underlying cloud object storage systems while enabling Apple's data scientists to process vast amounts of data to detect unusual patterns.

Databricks open-sourced Delta Lake in April 2019 to the Linux Foundation, but kept some features proprietary. In June 2022 Databricks open-sourced all of Delta Lake^[3]

Features

Delta Lake supports multiple compute engines, such as Apache Spark, Presto, Flink, Trino, and Apache Hive. It also provides APIs for different programming languages, such as Scala, Java, Python, Rust, and Ruby. Delta Lake extends Apache Parquet data files with a file-based transaction log that tracks every change to the data and prevents data corruption. Some of the main features are:

ACID Transactions
DML operations: support for commands such as MERGE, UPDATE, and DELETE commands
Scalable metadata: ability to handle petabyte-scale files with billions of partitions and files
Time travel: Access/revert to earlier versions of data for audits, rollbacks, or reproduce
Unified batch and streaming: Exactly once semantics ingestion to backfill to interactive queries
Schema evolution/enforcement: integrate augmented datasets and prevent bad data from causing data corruption

^[4]

Delta Lake 2.0.0 introduced the following features:

Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.

Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.

Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.

Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.

Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.

Python and Scala API support for OPTIMIZE file compaction and Z-order by.

^[5]

Architecture

Delta Lake works internally by extending Parquet data files with a file-based transaction log (aka "delta log") that tracks every change to the data and ensures ACID transactions. The transaction log consists of JSON files that contain information about the actions performed on the data, such as add, remove, set transaction, and commit. The transaction log also maintains a snapshot of the current state of the data by using checkpoints that store Parquet metadata.^[4]

References

↑ Armbrust, Michael; Das, Tathagata; Sun, Liwen; Yavuz, Burak; Zhu, Shixiong (2020-08-01). "Delta lake: high-performance ACID table storage over cloud object stores". Proceedings of the VLDB Endowment 13 (12): 3411–3424. doi:10.14778/3415478.3415560. https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf.
↑ "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. https://github.com/apache/iceberg/blob/master/LICENSE.
↑ ^3.0 ^3.1 Armbrust, Michael; Ghodsi, Ali (2022-06-30). "Open Sourcing All of Delta Lake". News. Databricks. https://www.databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html#:~:text=The%20Genesis%20of%20Delta%20Lake%20The%20genesis%20of,created%20Delta%20Lake%2C%20Spark%20SQL%2C%20and%20Structured%20Streaming%29.. Retrieved 2023-03-02.
↑ ^4.0 ^4.1 "Build Lakehouses with Delta Lake". https://delta.io/. Retrieved 2023-03-02.
↑ "Delta Lake Github Releases" (HTML). https://github.com/delta-io/delta/releases/. Retrieved 2023-03-02.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Delta Lake (Software). Read more

[journal-paper-1] Armbrust, Michael; Das, Tathagata; Sun, Liwen; Yavuz, Burak; Zhu, Shixiong (2020-08-01). "Delta lake: high-performance ACID table storage over cloud object stores". Proceedings of the VLDB Endowment 13 (12): 3411–3424. doi:10.14778/3415478.3415560. https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf.

[2] "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. https://github.com/apache/iceberg/blob/master/LICENSE.

[blog-post-3] 3.0 ^3.1 Armbrust, Michael; Ghodsi, Ali (2022-06-30). "Open Sourcing All of Delta Lake". News. Databricks. https://www.databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html#:~:text=The%20Genesis%20of%20Delta%20Lake%20The%20genesis%20of,created%20Delta%20Lake%2C%20Spark%20SQL%2C%20and%20Structured%20Streaming%29.. Retrieved 2023-03-02.

[project-website-4] 4.0 ^4.1 "Build Lakehouses with Delta Lake". https://delta.io/. Retrieved 2023-03-02.

[github-releases-5] "Delta Lake Github Releases" (HTML). https://github.com/delta-io/delta/releases/. Retrieved 2023-03-02.

[1]

[2]

[3]

[4]

[5]

Anonymous

Search

Software:Delta Lake

Namespaces

More

Page actions

Contents

History

Features

Architecture

References

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

Software:Delta Lake

History

Features

Architecture

References

Navigation

Wiki tools

Page tools

Other projects