New Case StudyHow NVIDIA Parabricks Accelerates Alzheimer's Research with STAR RNA-seq Analysis

Your logs are lying to you: Debugging CDC pipelines across Debezium, Kafka and Airflow

CDC pipelines fail in the gaps between Debezium, Kafka, and Airflow. Logs tell each system's local truth, but execution-level visibility shows what actually happened.

const metadata = ; Change Data Capture (CDC) has become the standard for real-time data replication. If you're building modern data infrastructure, you've probably deployed something like this: Source Database → Debezium → Kafka → Stream Processing → Airflow Transformations → Data Warehouse When this setup works, it works well. When it breaks, it's a nightmare. Debezium logs one thing. Kafka metrics show another. Airflow thinks everything succeeded. But your data quality checks are failing, and you're staring at row count mismatches you can't explain. You end up running manual data audits to fix issues you don't understand. The whole system becomes a black box that only one or two engineers can debug. When they're not available, incidents drag on for hours. The black box architecture Each component in a CDC pipeline logs differently: - Debezium logs to stdout and the Kafka Connect API - Kafka logs cluster events and per-topic metrics - Stream processors log to CloudWatch or local files - Airflow logs to its metadata database - Your warehouse has its own query logs When something breaks, you have to correlate timestamps across five different logging systems, each with different formats, retention policies, and levels of detail. Some components log verbosely. Others log almost nothing. The gaps between what each system sees are where the real problems hide. Real-world scenario: The silent data loss incident What happened You get a PagerDuty alert. Your data quality checks failed. The overnight batch found unexpected row count mismatches between your source database and warehouse. Some records are missing. Not all of them, just some. What the logs say Debezium `text [01:23:45] INFO: Snapshot started for table users [01:47:22] INFO: Snapshot completed successfully [01:47:23] INFO: Streaming changes from binlog position 4567891 ` Kafka `text [01:48:00] Partition 0: 45,234 messages [01:48:00] Partition 1: 44,891 messages ` Airflow `text [02:15:33] Task 'enrich_user_data' - State: success ` Great Expectations `text [02:30:15] Validation failed: Row count mismatch [02:30:15] Expected: ~1,245,000 rows [02:30:15] Actual: 1,198,432 rows [02:30:15] Missing: ~46,568 rows ` The investigation After three hours comparing row counts, checking Kafka offsets, reviewing logs, and querying for patterns, you finally discover the missing records all have timestamps in a specific two-minute window during the Debezium snapshot. But why? Debezium says the snapshot completed successfully. Kafka received messages. Airflow processed them. Where did these records disappear? This is where log-based debugging breaks down. Every component logged correctly. The problem is that logs only show each system's local perspective. The failure happened between systems, in the resource constraints and state transitions that no single component could see. What actually happened At this point, you're stuck with logs that all say "success." But what if you could see what actually executed on those machines, not just what got logged? The missing records were the ones Debezium was processing when it got killed. The connector was terminated mid-snapshot but didn't know it. When it restarted, it saw a partial progress marker and assumed the snapshot was done, skipping 46,568 rows. Here's the complete picture of what ran during those two minutes: `text 01:47:18 - Debezium process allocated 15.8 GB of memory 01:47:19 - Container memory limit: 16 GB 01:47:20 - Kafka broker on same node consumed additional 1.5 GB 01:47:21 - Total system memory exceeded limit 01:47:22 - Linux kernel OOM killer terminated Debezium (PID 47821) 01:47:22 - Kafka Connect detected exit, initiated restart 01:47:23 - New Debezium process started (PID 48103) 01:47:23 - Found partial snapshot marker in Kafka offsets 01:47:23 - Assumed snapshot complete, began streaming ` None of this appeared in any logs. Debezium logged "snapshot completed" because the new process saw what looked like a completed state. The OOM kill happened too fast for error logging. Kafka Connect's restart logic worked as designed. Airflow processed whatever data Kafka had. Great Expectations accurately reported the mismatch. Every component told the truth from its own perspective. But no component saw the whole picture. Why CDC pipelines are particularly hard to debug CDC architectures amplify three specific observability problems. 1. Failures cascade across systems When something breaks upstream, it triggers failures downstream. A database connection timeout in Debezium causes Kafka consumer lag, which triggers backpressure in stream processing, which delays Airflow tasks, which surfaces as a data quality failure. By the time you see the symptom, the root cause is five hops upstream and its logs have rotated out. 2. Silent failures in streaming systems A web app that crashes returns a 500 error immediately. A CDC connector that gets OOM-killed might restart automatically, resume from a checkpoint, continue processing, and never log an error. The failure is invisible until data downstream doesn't match expectations. 3. The semantic gap Debezium talks about snapshot modes and binlog positions. Kafka discusses partitions and offsets. Airflow references DAGs and tasks. Correlating "task failed" in Airflow to "consumer lag" in Kafka to "connector restart" in Debezium requires deep expertise in all three systems. How complete visibility changes debugging With just application logs, you're limited to each component's perspective. What if you could observe what actually executed, not just what got logged? When a data quality check fails in Airflow, you trace the task to the files it read, those files to the stream processor that created them, the processor to the Kafka partitions it consumed, those partitions to the Debezium connector that produced them, and finally to the database queries and resource usage that caused the failure. If any step had a silent failure like an OOM kill, a network partition, or a partial write, you'd see it. Not because someone predicted it and added logging, but because the execution record captured everything that ran. You observe what the operating system saw execute: process spawns, memory allocation, network connections, kill signals, all the things that happen between and beneath applications. Because this observation happens at the OS level, it works across your entire stack. Debezium, Kafka, custom stream processors, Airflow — it doesn't matter what frameworks you're using or whether they log well. You see what actually ran. No manual correlation across five logging systems. No translating between Debezium's binlog positions, Kafka's offsets, and Airflow's task states. The complete execution timeline shows what happened across the entire stack. Any engineer can trace failures backward, understand what went wrong, and know how to fix it. Not just the two people who built the system. The three-hour investigation becomes a three-minute diagnosis. The manual data audits stop. The black box becomes observable. Want to see the complete timeline of what's running in your pipelines? [Learn more at tracer.cloud](https://www.tracer.cloud/).