Spark | ifkarsyah

Projects

Data Engineering Garden — Knowledge Base

A public digital garden of data engineering notes, concepts, and guides — built with Quartz v4 and published as a static site.

Data Engineering SparkSQL

↗

Lakehouse Platform

Featured

A self-service data lakehouse built on Databricks and Delta Lake, unifying batch and streaming workloads with a single storage layer.

Data Engineering SparkDatabricksDelta Lake

↗

Blog Posts

Sep 15, 2024

Delta Lake Series, Part 6: Streaming & CDC

Writing to Delta with Structured Streaming, exactly-once guarantees, reading Delta as a stream, and Change Data Feed for downstream propagation.

Data Lake Delta LakeSpark

→

Sep 8, 2024

Delta Lake Series, Part 5: Performance Optimization

Making Delta Lake queries fast — OPTIMIZE, Z-ordering, data skipping with column statistics, compaction, and partitioning strategies.

Data Lake Delta LakeSpark

→

Sep 1, 2024

Delta Lake Series, Part 4: Time Travel & Versioning

Querying historical snapshots by version or timestamp, rolling back bad writes, auditing the table history, and managing retention with VACUUM.

Data Lake Delta LakeSpark

→

Aug 25, 2024

Delta Lake Series, Part 3: Schema Enforcement & Evolution

How Delta Lake validates schemas on write, rejects incompatible data, and handles controlled schema changes over time.

Data Lake Delta LakeSpark

→

Aug 18, 2024

Delta Lake Series, Part 2: Transaction Log & ACID

How the Delta Lake transaction log enables atomicity, serializable isolation, optimistic concurrency, and conflict resolution.

Data Lake Delta LakeSpark

→

Aug 11, 2024

Delta Lake Series, Part 1: Getting Started

Creating Delta tables, reading and writing with Spark, Delta SQL, and what the _delta_log looks like in practice.

Data Lake Delta LakeSpark

→

Aug 4, 2024

Delta Lake Series, Part 0: Overview

The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.

Data Lake Delta LakeSpark

→

Apr 7, 2024

Spark Streaming Series, Part 5: Operations and Tuning

Checkpointing, fault tolerance, exactly-once semantics, monitoring, and production performance tuning.

Streaming Spark

→

Mar 31, 2024

Spark Streaming Series, Part 4: Stateful Processing

Per-key state tracking across events, timeouts, and RocksDB state stores for complex streaming logic.

Streaming Spark

→

Mar 24, 2024

Spark Streaming Series, Part 3: Time, Watermarks, and Windows

Event time vs processing time, watermarks to handle late data, and window types for time-based aggregations.

Streaming Spark

→

Mar 17, 2024

Spark Streaming Series, Part 2: Sources and Sinks

Reading from Kafka and files, writing to Delta Lake and databases — the connectors that power real-time pipelines.

Streaming Spark

→

Mar 10, 2024

Spark Streaming Series, Part 1: Structured Streaming Fundamentals

The unbounded table model — how Spark Streaming treats streams as infinite DataFrames, with output modes, triggers, and writing.

Streaming Spark

→

Mar 3, 2024

Spark Streaming Series, Part 0: Overview

Stream processing with Apache Spark — from basics to Structured Streaming, the modern architecture for real-time data pipelines.

Streaming Spark

→

Feb 4, 2024

Spark Series, Part 4: Performance Tuning

Making Spark jobs fast — partitioning, shuffles, skew, caching, and the most common bottlenecks in production.

Data Engineering Spark

→

Jan 28, 2024

Spark Series, Part 3: Structured Streaming

Real-time data processing with Spark Structured Streaming — micro-batches, triggers, watermarks, and output modes.

Data Engineering Spark

→

Jan 21, 2024

Spark Series, Part 2: DataFrames and Spark SQL

The practical Spark API — working with structured data using DataFrames, schemas, and SQL queries.

Data Engineering Spark

→

Jan 14, 2024

Spark Series, Part 1: RDDs and the Execution Model

Understanding Resilient Distributed Datasets — the foundation of Spark's execution model, transformations, actions, and lazy evaluation.

Data Engineering Spark

→

Jan 7, 2024

Spark Series, Part 0: Overview

A high-level introduction to Apache Spark — what it is, why it exists, and where it fits in the modern data stack.

Data Engineering Spark

→