Blog | ifkarsyah

Apr 20, 2026

Elasticsearch Internals Series, Part 8: Write Path & Translog

The complete lifecycle of a write — from index request to durable disk storage. Translog as Elasticsearch's WAL, refresh vs flush, and tuning durability vs throughput.

Search Elasticsearch

→

Apr 20, 2026

Java Concurrency Series, Part 8: Virtual Threads & Project Loom

What changes when threads become cheap? Understand carrier threads, continuations, pinning, StructuredTaskScope, and how virtual threads flip the economics of I/O-bound Java services.

Backend Java

→

Apr 13, 2026

Elasticsearch Internals Series, Part 7: Cluster Architecture & Replication

Node roles, primary vs replica shards, the write path from primary to replicas, split-brain prevention with quorum, and observing cluster recovery under node failure.

Search Elasticsearch

→

Apr 13, 2026

Java Concurrency Series, Part 7: Concurrent Collections

Why can't you just wrap a HashMap in synchronized? Explore ConcurrentHashMap's node-level locking, CopyOnWriteArrayList's snapshot semantics, BlockingQueue variants, and a contention benchmark.

Backend Java

→

Apr 13, 2026

MySQL Internals Series, Part 7: Connection Management & Thread Model

How MySQL handles concurrent connections — thread pool, connection limits, resource management, and why connection pooling is essential.

Backend MySQL

→

Apr 13, 2026

PostgreSQL Internals Series, Part 7: Connection Management & Process Model

PostgreSQL's forking process model, backend lifecycle, connection pooling with PgBouncer, background workers, and resource limits per connection.

Backend PostgreSQL

→

Apr 6, 2026

Elasticsearch Internals Series, Part 6: Aggregations & Analytics

How aggregations work internally using doc_values, bucket vs metric vs pipeline aggs, cardinality approximation with HyperLogLog++, and building analytics dashboards.

Search Elasticsearch

→

Apr 6, 2026

Java Concurrency Series, Part 6: Thread Pools & Executors

Why does tuning your thread pool size matter? Understand ThreadPoolExecutor internals, queue types, rejection policies, ForkJoinPool work-stealing, and CompletableFuture async pipelines.

Backend Java

→

Apr 6, 2026

MySQL Internals Series, Part 6: Binary Logging & Replication

How MySQL survives crashes and enables replication — binary log format, GTID, crash recovery, and streaming replication.

Backend MySQL

→

Apr 6, 2026

PostgreSQL Internals Series, Part 6: Write-Ahead Logging & Replication

How PostgreSQL guarantees durability with WAL, recovers from crashes, and replicates to standbys using streaming replication and logical slots.

Infrastructure PostgreSQL

→

Mar 30, 2026

Elasticsearch Internals Series, Part 5: Query DSL Deep Dive

Query vs filter context, bool query anatomy, leaf queries, pagination strategies, and building a real product search from scratch.

Search Elasticsearch

→

Mar 30, 2026

Java Concurrency Series, Part 5: Atomic Operations & Lock-Free Programming

How can you update shared state without any lock? Understand the CAS CPU instruction, AtomicInteger, the ABA problem, LongAdder's striping trick, and VarHandle memory access modes.

Backend Java

→

Mar 30, 2026

MySQL Internals Series, Part 5: Buffer Pool & Memory Management

How InnoDB caches pages in memory — buffer pool, LRU eviction, dirty pages, checkpoints, and sizing.

Backend MySQL

→

Mar 30, 2026

PostgreSQL Internals Series, Part 5: Memory & BufferPool

How PostgreSQL manages memory — shared buffers, eviction policies, dirty pages, checkpoints, WAL buffers, and optimal sizing.

Backend PostgreSQL

→

Elasticsearch Search Internals and Scoring

Mar 23, 2026

Elasticsearch Internals Series, Part 4: Search Internals & Relevance Scoring

How a search query flows from client to shards and back, how BM25 calculates relevance scores, and how to debug scoring with the _explain API.

Search Elasticsearch

→

Mar 23, 2026

Java Concurrency Series, Part 4: java.util.concurrent Building Blocks

Why does ReentrantLock exist if synchronized works? Explore tryLock, StampedLock optimistic reads, Condition variables, and LockSupport — the primitives that underpin all of java.util.concurrent.

Backend Java

→

Mar 23, 2026

MySQL Internals Series, Part 4: Query Execution & Optimization

How MySQL optimizes queries — cost model, statistics, execution plans, and steering the optimizer with hints.

Backend MySQL

→

Mar 23, 2026

PostgreSQL Internals Series, Part 4: Query Planning & Optimization

How the planner estimates costs, uses statistics, chooses join strategies, and why it sometimes picks seq scans over indexes.

Backend PostgreSQL

→

Elasticsearch Document Storage and Mappings

Mar 16, 2026

Elasticsearch Internals Series, Part 3: Document Storage & Mappings

How Elasticsearch stores fields in multiple representations — _source, inverted index, doc_values, fielddata — and why the wrong mapping kills performance.

Search Elasticsearch

→

Mar 16, 2026

Java Concurrency Series, Part 3: synchronized & Intrinsic Locks

What does synchronized actually do at the JVM level? Explore object headers, lock state inflation from biased to fat locks, wait/notify semantics, and deadlock diagnosis with jstack.

Backend Java

→

Mar 16, 2026

MySQL Internals Series, Part 3: Indexes & B-Trees

How MySQL indexes work — B-tree structure, clustered vs secondary, covering indexes, adaptive hash indexes, and index fragmentation.

Backend MySQL

→

Mar 16, 2026

PostgreSQL Internals Series, Part 3: Indexes & B-Trees

How PostgreSQL indexes work — B-tree structure, scans, deduplication, index types, bloat detection, and when the planner uses them.

Backend PostgreSQL

→

Mar 9, 2026

Elasticsearch Internals Series, Part 2: Shards, Segments & Lucene

How Elasticsearch splits indexes into shards, how each shard is a Lucene index made of immutable segments, and why refresh interval controls search freshness.

Search Elasticsearch

→

Mar 9, 2026

Java Concurrency Series, Part 2: The Java Memory Model & Visibility

Why can a thread see stale data written by another? Understand CPU caches, write buffers, instruction reordering, and the happens-before relation that makes volatile work.

Backend Java

→

Mar 9, 2026

MySQL Internals Series, Part 2: MVCC & Transactions

How MySQL isolates transactions — MVCC, undo logs, transaction IDs, isolation levels, and the purge thread.

Backend MySQL

→

Mar 9, 2026

PostgreSQL Internals Series, Part 2: Transaction Isolation & MVCC

How PostgreSQL handles concurrent transactions — xmin/xmax visibility rules, snapshots, isolation levels, and the vacuum process.

Backend PostgreSQL

→

Mar 4, 2026

Debezium Series, Part 9: Production Concerns

Operating Debezium in production: offset management, failure recovery, monitoring connector lag, replication slot health, rebalancing, and the operational patterns that keep CDC pipelines healthy.

Streaming DebeziumKafka

→

Mar 3, 2026

Debezium Series, Part 8: Transforms & Routing

Single Message Transforms (SMTs) for reshaping, filtering, and routing CDC events. Field extraction, topic routing, sensitive data masking, and when to reach for a stream processor.

Streaming DebeziumKafka

→

Apache Pekko Series — Production Best Practices

Mar 3, 2026

Apache Pekko Series, Part 9: Production Best Practices

Running Pekko in production: Kafka connectors, OpenTelemetry distributed tracing, health checks, dispatcher tuning, Kubernetes deployment, and migrating from Akka.

Streaming PekkoScala

→

Mar 2, 2026

Debezium Series, Part 7: Snapshotting

How Debezium captures existing data before streaming live changes. All snapshot modes explained — initial, never, always, when_needed — plus isolation guarantees and large-table strategies.

Streaming DebeziumKafka

→

Mar 2, 2026

Elasticsearch Internals Series, Part 1: Inverted Index & Text Analysis

How Elasticsearch stores text for full-text search — inverted index structure, analyzers, tokenizers, token filters, and practical inspection with _analyze and _termvectors.

Search Elasticsearch

→

Mar 2, 2026

Java Concurrency Series, Part 1: Threads, the OS, and the JVM

What really happens when you call new Thread().start()? Trace the path from Java to the OS kernel, understand thread lifecycle states, and use jstack to observe live threads.

Backend Java

→

Mar 2, 2026

MySQL Internals Series, Part 1: Storage Engine & InnoDB Basics

How InnoDB stores data on disk — page structure, row format, clustered indexes, B-trees, and why the primary key matters.

Backend MySQL

→

Apache Pekko Series — CQRS & Projections

Mar 2, 2026

Apache Pekko Series, Part 8: CQRS & Projections

Separating write models from read models with CQRS. Pekko Projection — consuming the event journal to build materialized views, exactly-once processing, and offset tracking.

Streaming PekkoScala

→

Mar 2, 2026

PostgreSQL Internals Series, Part 1: Page Layout & Storage

How PostgreSQL stores data on disk — page structure, tuple anatomy, alignment, TOAST, and practical inspection with pageinspect.

Backend PostgreSQL

→

Mar 1, 2026

Debezium Series, Part 6: Handling Schema Changes

What happens when someone alters a table. DDL propagation, Schema Registry integration, breaking vs non-breaking changes, and strategies to evolve without downtime.

Streaming DebeziumKafka

→

Apache Pekko Series — Clustering & Distributed Actors

Mar 1, 2026

Apache Pekko Series, Part 7: Clustering & Distributed Actors

Running Pekko across multiple JVMs. Cluster membership, the gossip protocol, cluster sharding for stateful actors, and singleton actors — all with practical configuration examples.

Streaming PekkoScala

→

Feb 28, 2026

Claude Code Series, Part 6: Best Practices & Real-world Tips

Master Claude Code with effective prompting, context management, and workflow patterns.

AI Engineering Claude Code

→

Feb 28, 2026

Debezium Series, Part 5: Sink Connectors — Delta Lake & Iceberg

Landing CDC events into open table formats. Upsert and delete semantics with Delta Lake MERGE, Iceberg MERGE INTO, partition strategies, and JDBC sink for relational targets.

Streaming DebeziumKafka

→

Feb 28, 2026

Apache Pekko Series, Part 6: gRPC with Pekko

Protocol Buffers, generated Pekko service stubs, server and client setup, and bidirectional streaming. When to use gRPC instead of REST and how to run both side by side.

Streaming PekkoScala

→

Feb 27, 2026

Claude Code Series, Part 5: Building Claude Skills

Create reusable, shareable automation with Claude Skills.

AI Engineering Claude Code

→

Feb 27, 2026

Debezium Series, Part 4: Source Connectors — PostgreSQL & MySQL

Deep dive into PostgreSQL (pgoutput) and MySQL (binlog) source connectors. Configuration reference, behavioral differences, and connector-specific gotchas.

Streaming DebeziumKafka

→

Feb 27, 2026

Apache Pekko Series, Part 5: HTTP with Pekko

Build REST APIs with pekko-http's routing DSL. HTTP server setup, route composition, request and response marshalling, and integrating HTTP endpoints with an actor system.

Streaming PekkoScala

→

Feb 26, 2026

Claude Code Series, Part 4: MCP (Model Context Protocol)

Connect Claude Code to external services and expand its capabilities with MCP.

AI Engineering Claude Code

→

Feb 26, 2026

Debezium Series, Part 3: Change Event Anatomy

Dissecting every field in a Debezium change event — before, after, op, source metadata, tombstones, and how the Kafka message key is structured.

Streaming DebeziumKafka

→

Apache Pekko Series — Streams & Reactive Processing

Feb 26, 2026

Apache Pekko Series, Part 4: Streams & Reactive Processing

Source, Flow, and Sink — the building blocks of Pekko Streams. Backpressure by design, composable pipelines, and how to process data without dropping messages or crashing.

Streaming PekkoScala

→

Feb 25, 2026

Claude Code Series, Part 3: Custom Commands & Slash Prompts

Extend Claude Code with CLAUDE.md, slash commands, hooks, and automation.

AI Engineering Claude Code

→

Feb 25, 2026

Debezium Series, Part 2: Setting Up Debezium

Hands-on Docker Compose setup with PostgreSQL, Kafka, Kafka Connect, and the Debezium connector. See your first change event in under 10 minutes.

Streaming DebeziumKafka

→

Apache Pekko Series — Persistence & Event Sourcing

Feb 25, 2026

Apache Pekko Series, Part 3: Persistence & Event Sourcing

How EventSourcedBehavior works in Pekko: journals, snapshots, and recovery. Build actors whose state survives restarts by recording every change as an immutable event.

Streaming PekkoScala

→

Feb 24, 2026

Claude Code Series, Part 2: Hooks — Automate Workflows

Use hooks to automate code formatting, block edits to protected files, get notified when Claude needs input, and enforce project rules.

AI Engineering Claude Code

→

Feb 24, 2026

Debezium Series, Part 1: How CDC Works

Log-based vs query-based CDC, how PostgreSQL WAL and MySQL binlog work, what Debezium reads, and at-least-once delivery guarantees explained.

Streaming DebeziumKafka

→

Apache Pekko Series — Actor Lifecycle & Supervision

Feb 24, 2026

Apache Pekko Series, Part 2: Actor Lifecycle & Supervision

How actors start, fail, and recover. Parent-child supervision hierarchies, restart vs stop vs escalate strategies, and building self-healing systems in Pekko.

Streaming PekkoScala

→

Feb 23, 2026

Claude Code Series, Part 1: Getting Started

Installation, API setup, and your first Claude Code commands.

AI Engineering Claude Code

→

Feb 23, 2026

Debezium Series, Part 0: Overview

A practical guide to Change Data Capture with Debezium — from WAL internals to Delta Lake and Iceberg sinks. What you'll learn and why CDC matters.

Streaming DebeziumKafka

→

Feb 23, 2026

Elasticsearch Internals Series, Part 0: Overview

A roadmap through Elasticsearch 8.x internals — from inverted indexes to cluster replication. Why learning the engine makes you a better search engineer.

Search Elasticsearch

→

Feb 23, 2026

Java Concurrency Series, Part 0: Overview

A roadmap through Java concurrency — from threads and the memory model to virtual threads. Why getting concurrency right is hard, and what you'll learn.

Backend Java

→

Feb 23, 2026

MySQL Internals Series, Part 0: Overview

A roadmap through MySQL 8.4 LTS internals — from storage engines to replication. Why understanding the engine matters and what you'll learn.

Backend MySQL

→

Feb 23, 2026

Apache Pekko Series, Part 0: Overview

A practical guide to building concurrent, distributed, and resilient systems with Apache Pekko — the open-source fork of Akka. What you'll learn and why Pekko matters.

Streaming PekkoScala

→

Feb 23, 2026

PostgreSQL Internals Series, Part 0: Overview

A roadmap through PostgreSQL 18 internals — from storage to replication. Why learning the engine matters and what you'll build.

Backend PostgreSQL

→

Feb 23, 2026

Apache Pekko Series, Part 1: The Actor Model

What an actor is, how message passing replaces shared state, and how to create your first ActorSystem in Scala with Pekko Typed.

Streaming PekkoScala

→

Feb 22, 2026

Claude Code Series, Part 0: Overview

An introduction to Claude Code and how it differs from GitHub Copilot and Cursor.

AI Engineering Claude Code

→

Feb 22, 2026

Kubernetes Series, Part 0: Overview

What is Kubernetes, what problem it solves over bare metal and Docker, and a roadmap for running data workloads on K8s.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 1: Core Concepts

Pods, Deployments, Services, ConfigMaps, and Namespaces — the essential vocabulary every K8s user must know.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 2: Storage and Configuration

PersistentVolumes, PersistentVolumeClaims, StorageClasses, Secrets, and ConfigMaps — how stateful data workloads survive pod restarts.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 3: Workload Patterns for Data Engineering

StatefulSets, Jobs, CronJobs, and DaemonSets — the right workload type for each data engineering use case.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 5: Running Flink and Kafka on Kubernetes

Deploying Flink with the Flink Kubernetes Operator and Kafka with Strimzi — the streaming stack on K8s.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 6: Production Operations

Resource quotas, autoscaling (HPA/KEDA), monitoring with Prometheus and Grafana, and cluster cost management for data platforms.

Infrastructure Kubernetes

→

Feb 22, 2026

Kubernetes Series, Part 4: Running Spark on Kubernetes

Submitting Spark jobs natively to K8s, the Spark Operator, executor resource sizing, and shuffle storage.

Infrastructure Kubernetes

→

Nov 17, 2024

Databricks Series, Part 6: ML Serving and Workflows

Batch and real-time model inference, Databricks Model Serving endpoints, and orchestrating the full ML pipeline with Databricks Workflows.

Data Engineering DatabricksDelta Lake

→

Apache Iceberg Multi-Engine and Maintenance

Nov 17, 2024

Iceberg Series, Part 6: Multi-Engine & Maintenance

Querying Iceberg from Trino, Flink, and DuckDB; expiring snapshots; rewriting data files; and keeping Iceberg tables healthy in production.

Data Lake Apache Iceberg

→

Nov 10, 2024

Databricks Series, Part 5: Machine Learning with MLflow

Tracking experiments, logging models and artifacts, comparing runs, and managing the model lifecycle with MLflow on Databricks.

Data Engineering DatabricksDelta Lake

→

Nov 10, 2024

Iceberg Series, Part 5: Row-Level Operations

How MERGE, UPDATE, and DELETE work in Iceberg — copy-on-write vs merge-on-read, when to use each, and the performance trade-offs.

Data Lake Apache Iceberg

→

Nov 3, 2024

Databricks Series, Part 4: Feature Engineering at Scale

Databricks Feature Store, FeatureEngineeringClient, FeatureLookup, training sets, and eliminating training-serving skew.

Data Engineering DatabricksDelta Lake

→

Apache Iceberg Hidden Partitioning and Evolution

Nov 3, 2024

Iceberg Series, Part 4: Hidden Partitioning & Evolution

Partition transforms that derive partition values automatically, partition evolution that changes strategy without rewriting data, and why these are Iceberg's biggest ergonomic wins.

Data Lake Apache Iceberg

→

Oct 27, 2024

Databricks Series, Part 3: Data Ingestion with Auto Loader

cloudFiles format, schema inference, schema evolution, and building robust incremental ingestion pipelines on Databricks.

Data Engineering DatabricksDelta Lake

→

Oct 27, 2024

Iceberg Series, Part 3: Catalogs

How Hive, Glue, REST, and Nessie catalogs coordinate multi-engine access to Iceberg tables — and why the catalog abstraction is Iceberg's biggest differentiator.

Data Lake Apache Iceberg

→

Oct 20, 2024

Databricks Series, Part 2: Lakehouse Architecture

Unity Catalog for governance and discovery, the medallion Bronze/Silver/Gold pattern, and Delta tables as the storage foundation.

Data Engineering DatabricksDelta Lake

→

Oct 20, 2024

Iceberg Series, Part 2: Table Format Internals

The four-layer metadata hierarchy — table metadata, manifest lists, manifest files, and data files — and how it enables efficient scans and snapshot isolation.

Data Lake Apache Iceberg

→

Oct 13, 2024

Databricks Series, Part 1: Getting Started

Navigating the Databricks workspace, launching clusters, writing notebooks, and submitting your first PySpark job.

Data Engineering DatabricksDelta Lake

→

Oct 13, 2024

Iceberg Series, Part 1: Getting Started

Creating Iceberg tables with Spark, reads, writes, MERGE, time travel, and inspecting table history.

Data Lake Apache Iceberg

→

Oct 6, 2024

Databricks Series, Part 0: Overview

The lakehouse platform concept, what Databricks adds on top of Spark and Delta Lake, and how it compares to alternatives.

Data Engineering DatabricksDelta Lake

→

Oct 6, 2024

Iceberg Series, Part 0: Overview

What is Apache Iceberg, how does it differ from Delta Lake and Hudi, and why multi-engine interoperability is its defining advantage.

Data Lake Apache Iceberg

→

Sep 15, 2024

Delta Lake Series, Part 6: Streaming & CDC

Writing to Delta with Structured Streaming, exactly-once guarantees, reading Delta as a stream, and Change Data Feed for downstream propagation.

Data Lake Delta LakeSpark

→

Sep 8, 2024

Delta Lake Series, Part 5: Performance Optimization

Making Delta Lake queries fast — OPTIMIZE, Z-ordering, data skipping with column statistics, compaction, and partitioning strategies.

Data Lake Delta LakeSpark

→

Sep 1, 2024

Delta Lake Series, Part 4: Time Travel & Versioning

Querying historical snapshots by version or timestamp, rolling back bad writes, auditing the table history, and managing retention with VACUUM.

Data Lake Delta LakeSpark

→

Delta Lake Schema Enforcement and Evolution

Aug 25, 2024

Delta Lake Series, Part 3: Schema Enforcement & Evolution

How Delta Lake validates schemas on write, rejects incompatible data, and handles controlled schema changes over time.

Data Lake Delta LakeSpark

→

Aug 18, 2024

Delta Lake Series, Part 2: Transaction Log & ACID

How the Delta Lake transaction log enables atomicity, serializable isolation, optimistic concurrency, and conflict resolution.

Data Lake Delta LakeSpark

→

Aug 11, 2024

Delta Lake Series, Part 1: Getting Started

Creating Delta tables, reading and writing with Spark, Delta SQL, and what the _delta_log looks like in practice.

Data Lake Delta LakeSpark

→

Aug 4, 2024

Delta Lake Series, Part 0: Overview

The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.

Data Lake Delta LakeSpark

→

ClickHouse Materialized Views and Operations

Jul 14, 2024

ClickHouse Series, Part 6: Materialized Views & Operations

Pre-aggregation with materialized views, replication with ReplicatedMergeTree, sharding with Distributed tables, and production monitoring.

Database ClickHouse

→

Jul 7, 2024

ClickHouse Series, Part 5: Query Optimization

Making ClickHouse queries faster — profiling with system.query_log, projections, query patterns, and what actually moves the needle.

Database ClickHouse

→

ClickHouse Internals Parts Merges Indexes

Jun 30, 2024

ClickHouse Series, Part 4: Internals — Parts, Merges & Indexes

How ClickHouse actually stores data — parts, granules, the sparse primary index, data-skipping indexes, and the background merge process.

Database ClickHouse

→

Jun 23, 2024

ClickHouse Series, Part 3: Data Ingestion

Getting data into ClickHouse efficiently — batch inserts, async inserts, the Kafka table engine, S3 integration, and ingestion best practices.

Database ClickHouse

→

Jun 16, 2024

ClickHouse Series, Part 2: Schema Design

Choosing the right data types, ORDER BY key, partitioning strategy, and TTL — the decisions that determine query performance before a single query runs.

Database ClickHouse

→

Jun 9, 2024

ClickHouse Series, Part 1: MergeTree Engines

The storage engine family at the heart of ClickHouse — MergeTree and its specialized variants for deduplication, aggregation, and updates.

Database ClickHouse

→

Jun 2, 2024

ClickHouse Series, Part 0: Overview

What is ClickHouse, how does columnar storage work, and when should you use it? A roadmap for the series.

Database ClickHouse

→

May 19, 2024

Kafka Series, Part 6: Kafka Streams

Stream processing natively inside Kafka — KStream vs KTable, stateful aggregations, joins, windowing, and state stores.

Streaming Kafka

→

May 12, 2024

Kafka Series, Part 5: Kafka Connect

Moving data in and out of Kafka without writing custom code — connectors, transforms, and running Connect in production.

Streaming Kafka

→

May 5, 2024

Kafka Series, Part 4: Reliability & Operations

Replication, in-sync replicas, durability guarantees, and operational concerns for running Kafka in production.

Streaming Kafka

→

Apache Kafka Consumers and Consumer Groups

Apr 28, 2024

Kafka Series, Part 3: Consumers & Consumer Groups

Reading from Kafka at scale — consumer groups, partition assignment, offset commits, and handling rebalances.

Streaming Kafka

→

Apr 21, 2024

Kafka Series, Part 2: Producers

Writing to Kafka reliably — the producer API, batching, compression, delivery guarantees, and idempotent producers.

Streaming Kafka

→

Apr 14, 2024

Kafka Series, Part 1: Topics, Partitions & Offsets

The core data model behind Kafka — how topics are structured, why partitions matter, and how offsets track consumer position.

Streaming Kafka

→

Apr 7, 2024

Kafka Series, Part 0: Overview

What is Apache Kafka, what problem does it solve, and when should you use it? A roadmap for the series.

Streaming Kafka

→

Apache Spark Streaming Operations and Tuning

Apr 7, 2024

Spark Streaming Series, Part 5: Operations and Tuning

Checkpointing, fault tolerance, exactly-once semantics, monitoring, and production performance tuning.

Streaming Spark

→

Mar 31, 2024

Spark Streaming Series, Part 4: Stateful Processing

Per-key state tracking across events, timeouts, and RocksDB state stores for complex streaming logic.

Streaming Spark

→

Mar 24, 2024

Spark Streaming Series, Part 3: Time, Watermarks, and Windows

Event time vs processing time, watermarks to handle late data, and window types for time-based aggregations.

Streaming Spark

→

Apache Spark Streaming Sources and Sinks

Mar 17, 2024

Spark Streaming Series, Part 2: Sources and Sinks

Reading from Kafka and files, writing to Delta Lake and databases — the connectors that power real-time pipelines.

Streaming Spark

→

Mar 10, 2024

Flink Series, Part 5: Performance & Production

Making Flink production-ready — diagnosing backpressure, tuning parallelism, sizing network buffers, and monitoring with metrics.

Streaming Flink

→

Mar 10, 2024

Spark Streaming Series, Part 1: Structured Streaming Fundamentals

The unbounded table model — how Spark Streaming treats streams as infinite DataFrames, with output modes, triggers, and writing.

Streaming Spark

→

Mar 3, 2024

Flink Series, Part 4: Exactly-Once & Checkpointing

How Flink guarantees end-to-end correctness after failures — Chandy-Lamport barriers, two-phase commit, checkpoints vs savepoints.

Streaming Flink

→

Mar 3, 2024

Spark Streaming Series, Part 0: Overview

Stream processing with Apache Spark — from basics to Structured Streaming, the modern architecture for real-time data pipelines.

Streaming Spark

→

Feb 25, 2024

Flink Series, Part 3: State Management

How Flink stores and manages state — keyed vs operator state, state backends, TTL, and practical stateful patterns.

Streaming Flink

→

Feb 18, 2024

Flink Series, Part 2: Time & Windows

Flink's most powerful feature — temporal reasoning over streams. Event time, watermarks, and window types explained.

Streaming Flink

→

Feb 11, 2024

Flink Series, Part 1: DataStream API

The fundamental building block of Flink — how to read, transform, and write streams using the DataStream API.

Streaming Flink

→

Feb 4, 2024

Flink Series, Part 0: Overview

What is Apache Flink, what problem does it solve, and how does it differ from Spark Streaming? A roadmap for the series.

Streaming Flink

→

Feb 4, 2024

Spark Series, Part 4: Performance Tuning

Making Spark jobs fast — partitioning, shuffles, skew, caching, and the most common bottlenecks in production.

Data Engineering Spark

→

Jan 28, 2024

Spark Series, Part 3: Structured Streaming

Real-time data processing with Spark Structured Streaming — micro-batches, triggers, watermarks, and output modes.

Data Engineering Spark

→

Jan 21, 2024

Spark Series, Part 2: DataFrames and Spark SQL

The practical Spark API — working with structured data using DataFrames, schemas, and SQL queries.

Data Engineering Spark

→

Jan 14, 2024

Spark Series, Part 1: RDDs and the Execution Model

Understanding Resilient Distributed Datasets — the foundation of Spark's execution model, transformations, actions, and lazy evaluation.

Data Engineering Spark

→

Jan 7, 2024

Spark Series, Part 0: Overview

A high-level introduction to Apache Spark — what it is, why it exists, and where it fits in the modern data stack.

Data Engineering Spark

→

Nov 20, 2023

How Flink's Exactly-Once Semantics Actually Work

A deep dive into Flink's checkpointing mechanism and how it guarantees exactly-once processing even when jobs fail and restart.

Streaming Flink

→

Aug 5, 2023

Designing a Data Platform That Doesn't Rot

Lessons from building internal data platforms: what makes them last, what kills them, and the principles I try to apply.

Data Engineering

→