Software:Delta Lake

From HandWiki
Delta Lake
Original author(s)Michael Armbrust, Databricks
Initial releaseApril 2019; 5 years ago (April 2019)
Written inScala, Python
Operating systemCross-platform
TypeData warehouse, Data lake
LicenseApache License 2.0
Website

Delta Lake is an open-source storage framework that enables building a data lakehouse architecture with various compute engines and APIs. It brings ACID transactions and scalable metadata handling to big data workloads and addresses common issues with data lakes such as data quality, schema evolution, and concurrency control.[1] Delta Lake is a project under the Linux Foundation and is released under the Apache License.[2]

History

Delta Lake was created in response to a casual conversation at Spark Summit 2018 between Dominique Brezinski, distinguished engineer at Apple, and Michael Armbrust (creator of Spark SQL and Spark Structured Streaming). Dominique asked Michael for guidance on how to address the processing demands created by Apple's massive volumes of concurrent batch and streaming workloads, which were petabytes of log and telemetry data per day.

Data warehouses were not a viable option because (i) they were cost-prohibitive for the massive event data that they had, (ii) they did not support real-time streaming use cases which were essential for intrusion detection, and (iii) there was a lack of support for advanced machine learning. As such the team leveraged a traditional data lake as the only feasible option at the time. Dominique claimed the team struggled with data pipelines failing due to a large number of concurrent streaming and batch jobs and weren't able to ensure transactional consistency and data accessibility for all of their data.[3]

Over the next months, Michael and his team worked closely with Dominique's team to build this ingestion architecture designed to solve their large-scale data problem — allowing their team to easily and reliably handle low-latency stream processing and interactive queries without job failures or reliability issues with the underlying cloud object storage systems while enabling Apple's data scientists to process vast amounts of data to detect unusual patterns.

Databricks open-sourced Delta Lake in April 2019 to the Linux Foundation, but kept some features proprietary. In June 2022 Databricks open-sourced all of Delta Lake[3]

Features

Delta Lake supports multiple compute engines, such as Apache Spark, Presto, Flink, Trino, and Apache Hive. It also provides APIs for different programming languages, such as Scala, Java, Python, Rust, and Ruby. Delta Lake extends Apache Parquet data files with a file-based transaction log that tracks every change to the data and prevents data corruption. Some of the main features are:

  • ACID Transactions
  • DML operations: support for commands such as MERGE, UPDATE, and DELETE commands
  • Scalable metadata: ability to handle petabyte-scale files with billions of partitions and files
  • Time travel: Access/revert to earlier versions of data for audits, rollbacks, or reproduce
  • Unified batch and streaming: Exactly once semantics ingestion to backfill to interactive queries
  • Schema evolution/enforcement: integrate augmented datasets and prevent bad data from causing data corruption

[4]

Delta Lake 2.0.0 introduced the following features:

  • Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.
  • Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.
  • Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.
  • Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.
  • Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.
  • Python and Scala API support for OPTIMIZE file compaction and Z-order by.

[5]

Architecture

Delta Lake works internally by extending Parquet data files with a file-based transaction log (aka "delta log") that tracks every change to the data and ensures ACID transactions. The transaction log consists of JSON files that contain information about the actions performed on the data, such as add, remove, set transaction, and commit. The transaction log also maintains a snapshot of the current state of the data by using checkpoints that store Parquet metadata.[4]

References