New Case Study: How NVIDIA Parabricks Accelerates Alzheimer's Research with STAR RNA-seq Analysis
More

Execution-Level Observability for Data Pipelines

Go beyond traditional monitoring. Learn how execution-level observability reveals exactly what's happening inside your pipelines—step by step, run by run.

const metadata = ; The complete guide to execution-level observability for data pipelines Logs have been the default for application observability for decades. They're familiar, flexible and work pretty well for traditional web apps. But data pipelines aren't web applications. They're distributed systems where a single workflow spans multiple orchestrators, compute engines, storage layers, and data quality tools. Each component logs differently. Many don't log at all. When a pipeline breaks, you end up piecing together fragments of information from different systems. It’s like a detective trying to solve a crime based on statements from unreliable witnesses who only saw part of what happened. But there is a better way. Execution-level observability captures what actually ran at the kernel level, instead of relying on what applications choose to log. When you map that execution data to your pipeline tasks, you can compare what happened against what should have happened, automatically, every time an alert fires. The problem with logs Logs only show what we choose to log When we write logging code, we’re making predictions about what will matter when things go wrong. Sometimes these predictions are right, but sometimes they're not. Imagine a simple data transformation task that fails. The application logs might tell you: ⚠ "Task failed with exit code 1" ⚠ "Out of memory error" ⚠ Timestamp of failure But the logs won’t give you the specific details you need like: - Which specific subprocess consumed the memory - What data volume triggered the failure - How memory usage escalated over time - Which competing processes were running - What kernel-level resource constraints were hit If you didn’t predict you'd need this context, you probably didn’t choose to log it. So when an alert fires, the context you need to resolve the issue isn’t there. Log gaps compound across distributed systems Pipelines are distributed systems. A single DAG might involve: - Airflow for orchestration - Spark for compute - S3 for storage - dbt for transformations - Great Expectations for quality checks Each tool has its own logging format, dumps different amounts of data, and decides what matters. Some log to stdout, others to files and others still to specialized backends. When an incident spans multiple systems, you end up needing to correlate error messages across different formats and filling in gaps where systems didn't log anything. Logs are reactive, not observational We've all been there. An alert fires so you check the logs, only to find that they don't have what you need. So you add more logging, deploy, and wait for it to happen again. Sometimes the new logs help. Often they don't, and you add even more logging and repeat. You're debugging in production by asking past-you what they thought would matter. Past-you didn't know. That's why you're here in the first place. The problem is you're working backwards from incomplete information. Logs show you what someone decided to capture, not what actually happened. And by the time you realize the logs are missing something, the incident is already over. Logs disappear when you need them most A pipeline fails overnight. You check the logs in the morning only to find they're gone. Logs are ephemeral. They get overwritten, or disappear when containers restart. In Kubernetes, when a pod crashes, its logs often vanish with it. Retention policies prioritize storage costs over investigative needs, so you often end up needing to start over. The future: Execution-level observability Execution-level observability works differently. Instead of asking applications to report on themselves, you observe the operating system's record of what actually ran. The kernel sees everything Every process, system call, file operation, network request, and resource allocation goes through the Linux kernel. It's the single source of truth for what executed on a machine. With eBPF (extended Berkeley Packet Filter), you can tap into this data without modifying application code or impacting performance. When a data pipeline task executes, kernel-level observability captures: - Exact process tree (parent, children, subprocesses) - System calls made by each process - Files opened, read, written, closed - Network connections established - Memory allocated and freed - CPU scheduling decisions - Disk I/O patterns You get a detailed report on what actually happened, and you didn’t need to see into the future to set it up. Execution data maps to pipelines Of course, raw kernel events aren't useful by themselves. The magic is mapping execution-level signals to pipeline constructs you care about: - This Spark job → these specific JVM processes → these system calls → this I/O pattern - This Airflow task → these subprocess executions → these network requests → this storage access - This data quality check → this Python interpreter → these file reads → this memory spike When you can correlate kernel-level execution with pipeline tasks, you can get a clear picture into what actually ran. Cross-system correlation without log parsing Because execution observation happens at the kernel level, it's framework-agnostic. It doesn't matter if you're running Airflow, Prefect, Dagster, or custom scripts. It doesn't matter if your compute is Spark, Pandas, or dbt. The kernel sees all of it. This means you can correlate signals across your entire stack without parsing different log formats, aligning timestamps across systems, filling in gaps where tools don't log or guessing at causation from incomplete data. Execution-level observability in practise: Real incidents logs miss (but execution data catches) Let's look at specific examples where logs fail but execution-level observability succeeds. Example 1: The silent OOM kill What happened : A Spark job fails intermittently. Sometimes it succeeds. Sometimes it doesn't. There is no clear pattern to the failures. What the logs say : ` Task failed with exit code 137 Lost connection to executor ` Exit code 137 suggests the process was killed, but why? The logs don't say. Was it OOM? Resource limits? A kill signal? The Spark driver lost the executor but has no visibility into what happened on that node. What execution data shows : The kernel recorded that: - The JVM process allocated 15.2GB of memory - The cgroup memory limit was 16GB - Another process on the same node consumed an additional 2GB - The kernel OOM killer terminated the Spark executor - A specific data partition (partition_date=2024-01-15) triggered the memory spike The incident wasn't random. A specific date partition was 3x larger than typical. The job worked fine until hitting that partition, then exceeded memory limits and was killed. Logs never saw this because the kernel killed the process before it could report anything. Example 2: The cross-system authentication timeout What happened : Pipeline tasks fail with "connection timeout" errors when reading from cloud storage. What the logs say : ` Failed to read s3://bucket/path/file.parquet Connection timeout after 30s Retrying (attempt 2/3)... ` The application reports timeouts. Retries eventually succeed. But why are requests timing out in the first place? What execution data shows : The kernel captured: - Initial S3 request opened a TCP connection successfully - Connection stalled waiting for response - Meanwhile, a background credential refresh process was running - The credential refresh made its own API calls to AWS STS - Network connection limit was reached - Original S3 request timed out waiting for available connection slot - After credential refresh completed, connection slots freed up - Retry succeeded The root cause was a credential refresh process competing for network resources. Neither the application nor S3 logs showed this because the timeout happened at the connection pool level, not in the storage API. The kernel saw the complete picture: connection establishment, resource contention, timeout, and eventual success. Example 3: The cascading failure that logs blamed on the wrong component What happened : Data quality checks start failing. The team investigates Great Expectations, assuming the issue is in validation logic. What the logs say : ` [Great Expectations] Validation failed: unexpected null values in column 'user_id' [Airflow] Task 'validate_data' failed ` Looks like a data quality issue, right? The validation logs clearly show null values where they shouldn't exist. What execution data shows : The kernel revealed: - The upstream dbt transformation completed successfully according to its logs - But the dbt process was killed mid-write (SIGTERM) - The partial output file was still written to disk - Airflow marked the dbt task as "success" based on exit code - Great Expectations validated the incomplete file - Nulls were present because the file was truncated The validation didn't fail because of bad data. It failed because it validated a partially-written file from a killed process. Application logs from three different tools all missed this because each tool only saw its own narrow slice: dbt thought it succeeded, Airflow trusted the status code, and Great Expectations accurately reported what it saw in the file. Only kernel-level execution data showed the kill signal and incomplete write. From log-driven to execution-driven investigation Moving from logs to execution-level observability changes how you investigate incidents. Old approach: Log archaeology New approach: Execution replay 1. Incident fires 1. Incident fires 2. Check application logs for errors 2. Query execution data for the exact time window 3. Expand search to adjacent systems 3. See complete process tree, system calls, resource usage 4. Parse timestamps to correlate events 4. Map execution to pipeline tasks automatically 5. Fill in gaps with educated guesses 5. Identify root cause from observed behavior 6. Deploy additional logging 7. Wait for recurrence Timeline: Hours to days Timeline: Minutes Accuracy: Questionable Accuracy: What actually happened The investigation you can't do with logs With execution-level data, you can answer questions logs can't, like: - "What else was running when this task failed?" Kernel shows all processes, not just the one that logged errors. - "Did this process actually complete, or was it killed?" Exit codes and signals tell the real story, not just "task succeeded" logs. - "Which subprocess consumed the resources?" Process tree attribution shows exactly where memory/CPU went. - "What triggered this cascade of failures?" Execution timeline shows causation, not just correlation from timestamps. - "Has this exact execution pattern failed before?" Compare kernel signatures across incidents to find repeat issues. How teams can move beyond logs If execution-level observability offers better visibility, what's holding teams back? Understanding the common concerns can help you evaluate whether this approach makes sense for your pipelines. "This seems like a big shift from what we know" Log-based monitoring has been the standard for decades. Most teams have built deep expertise in log parsing, query languages, and aggregation tools. Moving to execution-level observability means rethinking what observability can be. Moving beyond logs doesn’t mean you need to replace everything you know. The concepts are similar. You're still investigating incidents, correlating signals, and finding root causes. The difference is the data source. Instead of application logs, you're working with execution data that's already being captured by the kernel. "Won't this require instrumenting our entire codebase?" This is a common concern, and it's understandable. Traditional observability approaches require: - Adding logging statements to code - Installing SDKs or agents - Framework-specific instrumentation - Redeploying applications eBPF-based observability works differently. The Linux kernel already tracks every process, syscall, file operation, and network request. There's no code to change, no agents to install, no redeployment needed. You're tapping into execution data that already exists. "How do we make kernel data useful for data engineering?" Raw kernel events like syscalls, process spawns and file I/O are too low-level for day-to-day pipeline work. This is where semantic mapping becomes critical. You need a system that can translate kernel execution into pipeline concepts you already understand like Airflow tasks, Spark jobs, dbt transformations, data quality checks. When this mapping works well, you get high-level pipeline insights without needing to understand kernel internals. You investigate a failed Airflow task the same way you always have. The difference is the investigation is based on what actually executed, not what got logged. Introducing Tracer: Execution-level observability with built in pipeline mapping If execution-level observability is so valuable, why isn't everyone already doing it? The short answer is that until recently, the technology and tooling weren't there. The longer answer involves kernel access concerns, the gap between raw kernel events and pipeline context, and the perception that "kernel-level" means "hard to deploy." Tracer solves these problems through a four-layer architecture that goes from raw kernel data to actionable pipeline insights. Layer 1: eBPF extraction without the operational risk Tracer uses eBPF (extended Berkeley Packet Filter) to extract execution data directly from the Linux kernel. eBPF programs are verified before they load, run in a sandboxed environment, and can't crash your system. The overhead is roughly 2% of system resources - low enough that you won't notice it in production. Layer 2: The semantic filter - from kernel events to pipeline context The filter layer automatically recognizes pipeline-specific patterns. When a process spawns, Tracer doesn't just see PID 47291 - it recognizes this is a Spark executor, running as part of task "transform_user_data" in your Airflow DAG. Tracer maps kernel execution to the constructs you actually care about: orchestrator tasks, compute jobs, data transformations, quality checks. You get high-level pipeline visibility from low-level execution data. Layer 3: Synthetic logs - filling the gaps When Tracer observes execution that isn't being logged, it generates synthetic logs in OpenTelemetry format. If a subprocess gets killed by the OOM killer before it can report anything, Tracer saw it happen at the kernel level and creates a log entry for it. You end up with a complete timeline of what executed, even for components that are silent in traditional logging. Layer 4: From data to insights The insights layer runs automated root cause analysis on execution data. When an alert fires, Tracer has already: - Built multiple hypotheses about what went wrong - Tested them against the execution timeline - Correlated signals across orchestrators, compute, storage, and data quality tools - Generated a report with recommended fixes You're not staring at dashboards trying to piece together what happened. The investigation is already done. Execution-level observability without the operational overhead What used to require specialized eBPF knowledge, custom instrumentation, and manual correlation is now a deployable system. You get: - Kernel-level visibility without kernel-level complexity - Pipeline semantics without framework lock-in - Complete execution timelines without waiting for logs - Automated investigations without building correlation logic The technology barrier that kept execution-level observability niche doesn't exist anymore. What's left is deciding whether you want to keep debugging with incomplete information or move to observing what actually ran. Moving past log archaeology to execution observation We've been debugging pipelines the same way for twenty years. We check the log, correlate timestamps and fill in the blanks. We add more logging, wait for the issue to happen again and repeat. It works, mostly. But it's slow, and it gets worse as your pipelines get more complex. Execution-level observability is different. Instead of asking what the application logged, you're seeing what actually ran. You’re looking at the complete execution timeline instead of piecing together fragments. And you're not waiting for incidents to recur so you can add better logging - the data's already there. Execution-level observability opens the door for automated incident investigation - AI agents that identify root causes, test hypotheses in parallel and deliver full incident reports before you even get pinged. Ready for a new approach to incident investigation? Get started for free at tracer.cloud.
Background

Get Started Now

Ready to See
Tracer In Action?

Start for free or
Tracer Logo

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy