Real-time Ingestion Pipeline

Overview

A production-grade streaming data pipeline that ingests, transforms, and loads event data in real time. Built to replace a fragile batch-based system that caused multi-hour reporting delays.

Architecture

Events flow from application services into Kafka topics. Apache Flink jobs consume these topics, apply transformations and late-event handling, and sink results into ClickHouse for low-latency OLAP queries.

[App Services] → Kafka → [Flink Jobs] → ClickHouse
                              ↓
                        [Dead Letter Queue]

Features

Exactly-once semantics – Flink checkpointing with Kafka offset tracking
Late event handling – Configurable watermarking with side-output for late records
Schema evolution – Avro schemas with a Schema Registry for forward/backward compatibility
Backpressure-aware – Flink’s credit-based flow control prevents consumer lag accumulation
Kubernetes-native – Deployed via Flink Kubernetes Operator with auto-scaling

Results

Processes 500k+ events/second at peak load
End-to-end latency under 3 seconds (event time to queryable in ClickHouse)
Replaced a 4-hour nightly batch job with continuous real-time updates