Streaming
42 items in Streaming
Projects
Clickstreamer — Real-time Clickstream Pipeline
End-to-end clickstream analytics pipeline using Kafka, Apache Flink, ClickHouse, and Grafana with a full Docker Compose setup.
Real-time Ingestion Pipeline
FeaturedA high-throughput streaming ingestion platform built with Apache Flink and Kafka, processing 500k+ events/sec into ClickHouse.
Blog Posts
Debezium Series, Part 9: Production Concerns
Operating Debezium in production: offset management, failure recovery, monitoring connector lag, replication slot health, rebalancing, and the operational patterns that keep CDC pipelines healthy.
Debezium Series, Part 8: Transforms & Routing
Single Message Transforms (SMTs) for reshaping, filtering, and routing CDC events. Field extraction, topic routing, sensitive data masking, and when to reach for a stream processor.
Apache Pekko Series, Part 9: Production Best Practices
Running Pekko in production: Kafka connectors, OpenTelemetry distributed tracing, health checks, dispatcher tuning, Kubernetes deployment, and migrating from Akka.
Debezium Series, Part 7: Snapshotting
How Debezium captures existing data before streaming live changes. All snapshot modes explained — initial, never, always, when_needed — plus isolation guarantees and large-table strategies.
Apache Pekko Series, Part 8: CQRS & Projections
Separating write models from read models with CQRS. Pekko Projection — consuming the event journal to build materialized views, exactly-once processing, and offset tracking.
Debezium Series, Part 6: Handling Schema Changes
What happens when someone alters a table. DDL propagation, Schema Registry integration, breaking vs non-breaking changes, and strategies to evolve without downtime.
Apache Pekko Series, Part 7: Clustering & Distributed Actors
Running Pekko across multiple JVMs. Cluster membership, the gossip protocol, cluster sharding for stateful actors, and singleton actors — all with practical configuration examples.
Debezium Series, Part 5: Sink Connectors — Delta Lake & Iceberg
Landing CDC events into open table formats. Upsert and delete semantics with Delta Lake MERGE, Iceberg MERGE INTO, partition strategies, and JDBC sink for relational targets.
Apache Pekko Series, Part 6: gRPC with Pekko
Protocol Buffers, generated Pekko service stubs, server and client setup, and bidirectional streaming. When to use gRPC instead of REST and how to run both side by side.
Debezium Series, Part 4: Source Connectors — PostgreSQL & MySQL
Deep dive into PostgreSQL (pgoutput) and MySQL (binlog) source connectors. Configuration reference, behavioral differences, and connector-specific gotchas.
Apache Pekko Series, Part 5: HTTP with Pekko
Build REST APIs with pekko-http's routing DSL. HTTP server setup, route composition, request and response marshalling, and integrating HTTP endpoints with an actor system.
Debezium Series, Part 3: Change Event Anatomy
Dissecting every field in a Debezium change event — before, after, op, source metadata, tombstones, and how the Kafka message key is structured.
Apache Pekko Series, Part 4: Streams & Reactive Processing
Source, Flow, and Sink — the building blocks of Pekko Streams. Backpressure by design, composable pipelines, and how to process data without dropping messages or crashing.
Debezium Series, Part 2: Setting Up Debezium
Hands-on Docker Compose setup with PostgreSQL, Kafka, Kafka Connect, and the Debezium connector. See your first change event in under 10 minutes.
Apache Pekko Series, Part 3: Persistence & Event Sourcing
How EventSourcedBehavior works in Pekko: journals, snapshots, and recovery. Build actors whose state survives restarts by recording every change as an immutable event.
Debezium Series, Part 1: How CDC Works
Log-based vs query-based CDC, how PostgreSQL WAL and MySQL binlog work, what Debezium reads, and at-least-once delivery guarantees explained.
Apache Pekko Series, Part 2: Actor Lifecycle & Supervision
How actors start, fail, and recover. Parent-child supervision hierarchies, restart vs stop vs escalate strategies, and building self-healing systems in Pekko.
Debezium Series, Part 0: Overview
A practical guide to Change Data Capture with Debezium — from WAL internals to Delta Lake and Iceberg sinks. What you'll learn and why CDC matters.
Apache Pekko Series, Part 0: Overview
A practical guide to building concurrent, distributed, and resilient systems with Apache Pekko — the open-source fork of Akka. What you'll learn and why Pekko matters.
Apache Pekko Series, Part 1: The Actor Model
What an actor is, how message passing replaces shared state, and how to create your first ActorSystem in Scala with Pekko Typed.
Kafka Series, Part 6: Kafka Streams
Stream processing natively inside Kafka — KStream vs KTable, stateful aggregations, joins, windowing, and state stores.
Kafka Series, Part 5: Kafka Connect
Moving data in and out of Kafka without writing custom code — connectors, transforms, and running Connect in production.
Kafka Series, Part 4: Reliability & Operations
Replication, in-sync replicas, durability guarantees, and operational concerns for running Kafka in production.
Kafka Series, Part 3: Consumers & Consumer Groups
Reading from Kafka at scale — consumer groups, partition assignment, offset commits, and handling rebalances.
Kafka Series, Part 2: Producers
Writing to Kafka reliably — the producer API, batching, compression, delivery guarantees, and idempotent producers.
Kafka Series, Part 1: Topics, Partitions & Offsets
The core data model behind Kafka — how topics are structured, why partitions matter, and how offsets track consumer position.
Kafka Series, Part 0: Overview
What is Apache Kafka, what problem does it solve, and when should you use it? A roadmap for the series.
Spark Streaming Series, Part 5: Operations and Tuning
Checkpointing, fault tolerance, exactly-once semantics, monitoring, and production performance tuning.
Spark Streaming Series, Part 4: Stateful Processing
Per-key state tracking across events, timeouts, and RocksDB state stores for complex streaming logic.
Spark Streaming Series, Part 3: Time, Watermarks, and Windows
Event time vs processing time, watermarks to handle late data, and window types for time-based aggregations.
Spark Streaming Series, Part 2: Sources and Sinks
Reading from Kafka and files, writing to Delta Lake and databases — the connectors that power real-time pipelines.
Flink Series, Part 5: Performance & Production
Making Flink production-ready — diagnosing backpressure, tuning parallelism, sizing network buffers, and monitoring with metrics.
Spark Streaming Series, Part 1: Structured Streaming Fundamentals
The unbounded table model — how Spark Streaming treats streams as infinite DataFrames, with output modes, triggers, and writing.
Flink Series, Part 4: Exactly-Once & Checkpointing
How Flink guarantees end-to-end correctness after failures — Chandy-Lamport barriers, two-phase commit, checkpoints vs savepoints.
Spark Streaming Series, Part 0: Overview
Stream processing with Apache Spark — from basics to Structured Streaming, the modern architecture for real-time data pipelines.
Flink Series, Part 3: State Management
How Flink stores and manages state — keyed vs operator state, state backends, TTL, and practical stateful patterns.
Flink Series, Part 2: Time & Windows
Flink's most powerful feature — temporal reasoning over streams. Event time, watermarks, and window types explained.
Flink Series, Part 1: DataStream API
The fundamental building block of Flink — how to read, transform, and write streams using the DataStream API.
Flink Series, Part 0: Overview
What is Apache Flink, what problem does it solve, and how does it differ from Spark Streaming? A roadmap for the series.
How Flink's Exactly-Once Semantics Actually Work
A deep dive into Flink's checkpointing mechanism and how it guarantees exactly-once processing even when jobs fail and restart.