Spark
20 items using Spark
Projects
Data Engineering Garden — Knowledge Base
A public digital garden of data engineering notes, concepts, and guides — built with Quartz v4 and published as a static site.
Lakehouse Platform
FeaturedA self-service data lakehouse built on Databricks and Delta Lake, unifying batch and streaming workloads with a single storage layer.
Blog Posts
Delta Lake Series, Part 6: Streaming & CDC
Writing to Delta with Structured Streaming, exactly-once guarantees, reading Delta as a stream, and Change Data Feed for downstream propagation.
Delta Lake Series, Part 5: Performance Optimization
Making Delta Lake queries fast — OPTIMIZE, Z-ordering, data skipping with column statistics, compaction, and partitioning strategies.
Delta Lake Series, Part 4: Time Travel & Versioning
Querying historical snapshots by version or timestamp, rolling back bad writes, auditing the table history, and managing retention with VACUUM.
Delta Lake Series, Part 3: Schema Enforcement & Evolution
How Delta Lake validates schemas on write, rejects incompatible data, and handles controlled schema changes over time.
Delta Lake Series, Part 2: Transaction Log & ACID
How the Delta Lake transaction log enables atomicity, serializable isolation, optimistic concurrency, and conflict resolution.
Delta Lake Series, Part 1: Getting Started
Creating Delta tables, reading and writing with Spark, Delta SQL, and what the _delta_log looks like in practice.
Delta Lake Series, Part 0: Overview
The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.
Spark Streaming Series, Part 5: Operations and Tuning
Checkpointing, fault tolerance, exactly-once semantics, monitoring, and production performance tuning.
Spark Streaming Series, Part 4: Stateful Processing
Per-key state tracking across events, timeouts, and RocksDB state stores for complex streaming logic.
Spark Streaming Series, Part 3: Time, Watermarks, and Windows
Event time vs processing time, watermarks to handle late data, and window types for time-based aggregations.
Spark Streaming Series, Part 2: Sources and Sinks
Reading from Kafka and files, writing to Delta Lake and databases — the connectors that power real-time pipelines.
Spark Streaming Series, Part 1: Structured Streaming Fundamentals
The unbounded table model — how Spark Streaming treats streams as infinite DataFrames, with output modes, triggers, and writing.
Spark Streaming Series, Part 0: Overview
Stream processing with Apache Spark — from basics to Structured Streaming, the modern architecture for real-time data pipelines.
Spark Series, Part 4: Performance Tuning
Making Spark jobs fast — partitioning, shuffles, skew, caching, and the most common bottlenecks in production.
Spark Series, Part 3: Structured Streaming
Real-time data processing with Spark Structured Streaming — micro-batches, triggers, watermarks, and output modes.
Spark Series, Part 2: DataFrames and Spark SQL
The practical Spark API — working with structured data using DataFrames, schemas, and SQL queries.
Spark Series, Part 1: RDDs and the Execution Model
Understanding Resilient Distributed Datasets — the foundation of Spark's execution model, transformations, actions, and lazy evaluation.
Spark Series, Part 0: Overview
A high-level introduction to Apache Spark — what it is, why it exists, and where it fits in the modern data stack.