Projects

Blog Posts

Debezium Series, Part 9: Production Concerns

Operating Debezium in production: offset management, failure recovery, monitoring connector lag, replication slot health, rebalancing, and the operational patterns that keep CDC pipelines healthy.

Debezium Series, Part 8: Transforms & Routing

Single Message Transforms (SMTs) for reshaping, filtering, and routing CDC events. Field extraction, topic routing, sensitive data masking, and when to reach for a stream processor.

Apache Pekko Series, Part 9: Production Best Practices

Running Pekko in production: Kafka connectors, OpenTelemetry distributed tracing, health checks, dispatcher tuning, Kubernetes deployment, and migrating from Akka.

Debezium Series, Part 7: Snapshotting

How Debezium captures existing data before streaming live changes. All snapshot modes explained — initial, never, always, when_needed — plus isolation guarantees and large-table strategies.

Apache Pekko Series, Part 8: CQRS & Projections

Separating write models from read models with CQRS. Pekko Projection — consuming the event journal to build materialized views, exactly-once processing, and offset tracking.

Debezium Series, Part 6: Handling Schema Changes

What happens when someone alters a table. DDL propagation, Schema Registry integration, breaking vs non-breaking changes, and strategies to evolve without downtime.

Apache Pekko Series, Part 7: Clustering & Distributed Actors

Running Pekko across multiple JVMs. Cluster membership, the gossip protocol, cluster sharding for stateful actors, and singleton actors — all with practical configuration examples.

Debezium Series, Part 5: Sink Connectors — Delta Lake & Iceberg

Landing CDC events into open table formats. Upsert and delete semantics with Delta Lake MERGE, Iceberg MERGE INTO, partition strategies, and JDBC sink for relational targets.

Apache Pekko Series, Part 6: gRPC with Pekko

Protocol Buffers, generated Pekko service stubs, server and client setup, and bidirectional streaming. When to use gRPC instead of REST and how to run both side by side.

Debezium Series, Part 4: Source Connectors — PostgreSQL & MySQL

Deep dive into PostgreSQL (pgoutput) and MySQL (binlog) source connectors. Configuration reference, behavioral differences, and connector-specific gotchas.

Apache Pekko Series, Part 5: HTTP with Pekko

Build REST APIs with pekko-http's routing DSL. HTTP server setup, route composition, request and response marshalling, and integrating HTTP endpoints with an actor system.

Debezium Series, Part 3: Change Event Anatomy

Dissecting every field in a Debezium change event — before, after, op, source metadata, tombstones, and how the Kafka message key is structured.

Apache Pekko Series, Part 4: Streams & Reactive Processing

Source, Flow, and Sink — the building blocks of Pekko Streams. Backpressure by design, composable pipelines, and how to process data without dropping messages or crashing.

Debezium Series, Part 2: Setting Up Debezium

Hands-on Docker Compose setup with PostgreSQL, Kafka, Kafka Connect, and the Debezium connector. See your first change event in under 10 minutes.

Apache Pekko Series, Part 3: Persistence & Event Sourcing

How EventSourcedBehavior works in Pekko: journals, snapshots, and recovery. Build actors whose state survives restarts by recording every change as an immutable event.

Debezium Series, Part 1: How CDC Works

Log-based vs query-based CDC, how PostgreSQL WAL and MySQL binlog work, what Debezium reads, and at-least-once delivery guarantees explained.

Apache Pekko Series, Part 2: Actor Lifecycle & Supervision

How actors start, fail, and recover. Parent-child supervision hierarchies, restart vs stop vs escalate strategies, and building self-healing systems in Pekko.

Debezium Series, Part 0: Overview

A practical guide to Change Data Capture with Debezium — from WAL internals to Delta Lake and Iceberg sinks. What you'll learn and why CDC matters.

Apache Pekko Series, Part 0: Overview

A practical guide to building concurrent, distributed, and resilient systems with Apache Pekko — the open-source fork of Akka. What you'll learn and why Pekko matters.

Apache Pekko Series, Part 1: The Actor Model

What an actor is, how message passing replaces shared state, and how to create your first ActorSystem in Scala with Pekko Typed.

Kafka Series, Part 6: Kafka Streams

Stream processing natively inside Kafka — KStream vs KTable, stateful aggregations, joins, windowing, and state stores.

Kafka Series, Part 5: Kafka Connect

Moving data in and out of Kafka without writing custom code — connectors, transforms, and running Connect in production.

Kafka Series, Part 4: Reliability & Operations

Replication, in-sync replicas, durability guarantees, and operational concerns for running Kafka in production.

Kafka Series, Part 3: Consumers & Consumer Groups

Reading from Kafka at scale — consumer groups, partition assignment, offset commits, and handling rebalances.

Kafka Series, Part 2: Producers

Writing to Kafka reliably — the producer API, batching, compression, delivery guarantees, and idempotent producers.

Kafka Series, Part 1: Topics, Partitions & Offsets

The core data model behind Kafka — how topics are structured, why partitions matter, and how offsets track consumer position.

Kafka Series, Part 0: Overview

What is Apache Kafka, what problem does it solve, and when should you use it? A roadmap for the series.

Spark Streaming Series, Part 5: Operations and Tuning

Checkpointing, fault tolerance, exactly-once semantics, monitoring, and production performance tuning.

Spark Streaming Series, Part 4: Stateful Processing

Per-key state tracking across events, timeouts, and RocksDB state stores for complex streaming logic.

Spark Streaming Series, Part 3: Time, Watermarks, and Windows

Event time vs processing time, watermarks to handle late data, and window types for time-based aggregations.

Spark Streaming Series, Part 2: Sources and Sinks

Reading from Kafka and files, writing to Delta Lake and databases — the connectors that power real-time pipelines.

Flink Series, Part 5: Performance & Production

Making Flink production-ready — diagnosing backpressure, tuning parallelism, sizing network buffers, and monitoring with metrics.

Spark Streaming Series, Part 1: Structured Streaming Fundamentals

The unbounded table model — how Spark Streaming treats streams as infinite DataFrames, with output modes, triggers, and writing.

Flink Series, Part 4: Exactly-Once & Checkpointing

How Flink guarantees end-to-end correctness after failures — Chandy-Lamport barriers, two-phase commit, checkpoints vs savepoints.

Spark Streaming Series, Part 0: Overview

Stream processing with Apache Spark — from basics to Structured Streaming, the modern architecture for real-time data pipelines.

Flink Series, Part 3: State Management

How Flink stores and manages state — keyed vs operator state, state backends, TTL, and practical stateful patterns.

Flink Series, Part 2: Time & Windows

Flink's most powerful feature — temporal reasoning over streams. Event time, watermarks, and window types explained.

Flink Series, Part 1: DataStream API

The fundamental building block of Flink — how to read, transform, and write streams using the DataStream API.

Flink Series, Part 0: Overview

What is Apache Flink, what problem does it solve, and how does it differ from Spark Streaming? A roadmap for the series.

How Flink's Exactly-Once Semantics Actually Work

A deep dive into Flink's checkpointing mechanism and how it guarantees exactly-once processing even when jobs fail and restart.