Delta Lake Series, Part 0: Overview

The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.

Delta Lake Overview

The Data Lake Reliability Problem

Object storage (S3, GCS, ADLS) is cheap, scalable, and durable. Parquet is a fast, columnar file format. Together they seem like the perfect foundation for a data lake — until you try to use them for anything beyond simple append-only batch pipelines.

The problems show up quickly:

No atomicity. Writing a dataset means uploading many Parquet files. If the job crashes midway, you are left with a partial write. Readers see corrupted or incomplete data.

No isolation. If a reader queries the table while a writer is uploading new files, it sees a mix of old and new data — or worse, reads files that are about to be deleted.

No schema enforcement. Nothing stops a job from writing a file with the wrong column types. The error surfaces hours later in a downstream query.

Expensive updates and deletes. Parquet files are immutable. To update a row, you must rewrite the entire file. There is no built-in mechanism to track what changed.

No history. Once a file is overwritten, the previous data is gone. Reproducing last week’s report requires careful manual archiving.

Delta Lake solves all of these by adding a transaction log on top of Parquet and object storage.

What Delta Lake Is

Delta Lake is an open-source storage layer that sits between your compute engine and your Parquet files. It adds:

  • ACID transactions: writes are atomic — either all files are committed or none are
  • Serializable isolation: readers always see a consistent snapshot; concurrent writers are coordinated
  • Schema enforcement: writes that violate the table schema are rejected at write time
  • Schema evolution: controlled, auditable schema changes without rewriting data
  • Time travel: query any previous version of the table by timestamp or version number
  • Audit history: a complete log of every operation ever performed on the table
  • Scalable metadata: handles tables with millions of files without listing overhead

Delta Lake stores data as ordinary Parquet files. The only addition is a _delta_log/ directory containing JSON and Parquet checkpoint files — the transaction log that tracks every change.

s3://my-bucket/tables/events/
├── _delta_log/
│   ├── 00000000000000000000.json   ← version 0 (table creation)
│   ├── 00000000000000000001.json   ← version 1 (first insert)
│   ├── 00000000000000000002.json   ← version 2 (second insert)
│   └── 00000000000000000010.checkpoint.parquet
├── part-00000-abc123.snappy.parquet
├── part-00001-def456.snappy.parquet
└── part-00002-ghi789.snappy.parquet

Because Delta Lake is just Parquet + a log, any tool that can read Parquet can read Delta Lake files directly (though it won’t understand the transaction semantics without the Delta Lake library).

Delta Lake vs Apache Iceberg vs Apache Hudi

All three are open table formats that add ACID and metadata management to object storage. They solve the same core problem but with different trade-offs:

Delta LakeApache IcebergApache Hudi
OriginDatabricks (2019)Netflix (2018)Uber (2019)
Open governanceLinux FoundationApache Software FoundationApache Software Foundation
Metadata formatJSON + Parquet checkpointsAvro + Parquet manifestsAvro timeline
Primary computeSpark (+ Flink, Trino)Spark, Flink, Trino, HiveSpark (+ Flink)
Row-level updatesMerge-on-read or copy-on-writeCopy-on-write or merge-on-readMerge-on-read (native)
Time travelYes (version or timestamp)Yes (snapshot ID or timestamp)Yes (commit timeline)
Streaming writesYes (Structured Streaming)YesYes (native design goal)
Ecosystem maturityLargest (Databricks, Azure, GCP)Growing fast (AWS default)Strong at Uber-scale

Delta Lake is the easiest starting point if you are already using Spark or Databricks. It has the largest ecosystem and the most mature SQL support.

Iceberg is the better choice for multi-engine environments (you want Trino, Flink, and Spark all reading the same tables), or if you need AWS Glue / S3 Tables integration out of the box.

Hudi has native streaming write support and was designed for low-latency CDC ingestion at scale — Uber’s original use case.

When to Use Delta Lake

Delta Lake is the right choice when you need:

  • Reliable batch pipelines on object storage without worrying about partial writes
  • Upserts and deletes on large datasets (GDPR compliance, CDC from databases)
  • Time travel for auditing, reproducible reports, or recovering from bad writes
  • Schema governance — enforcing structure on a shared data lake
  • Streaming + batch unified — the same table receives streaming writes and serves batch queries

Delta Lake is not a query engine. It does not replace Spark, Trino, or DuckDB — it is the storage format those engines read and write.

Series Roadmap

  • Part 1 — Getting Started: Creating tables, reads, writes, and the _delta_log in practice
  • Part 2 — Transaction Log & ACID: How the log enables atomicity, isolation, and conflict resolution
  • Part 3 — Schema Enforcement & Evolution: Schema validation on write and controlled evolution
  • Part 4 — Time Travel & Versioning: Querying history, rollback, and retention management
  • Part 5 — Performance Optimization: OPTIMIZE, Z-ordering, data skipping, and partitioning
  • Part 6 — Streaming & CDC: Structured Streaming writes, exactly-once, and Change Data Feed

Key Takeaways

  • Plain Parquet + object storage lacks atomicity, isolation, schema enforcement, and history
  • Delta Lake adds a transaction log on top of Parquet — no new file format, just a _delta_log/ directory
  • It provides ACID transactions, time travel, schema enforcement, and scalable metadata
  • Delta Lake, Iceberg, and Hudi solve the same problem; Delta Lake has the largest Spark/Databricks ecosystem

Next: Getting Started

← Back to Blog