Execution-Level Observability for Data Pipelines
Go beyond traditional monitoring. Learn how execution-level observability reveals exactly what's happening inside your pipelines—step by step, run by run.
const metadata = ;
The complete guide to execution-level observability for data pipelines
Logs have been the default for application observability for decades. They're familiar, flexible
and work pretty well for traditional web apps. But data pipelines aren't web applications. They're
distributed systems where a single workflow spans multiple orchestrators, compute engines,
storage layers, and data quality tools. Each component logs differently. Many don't log at all.
When a pipeline breaks, you end up piecing together fragments of information from different
systems. It’s like a detective trying to solve a crime based on statements from unreliable
witnesses who only saw part of what happened.
But there is a better way. Execution-level observability captures what actually ran at the kernel
level, instead of relying on what applications choose to log. When you map that execution data
to your pipeline tasks, you can compare what happened against what should have happened,
automatically, every time an alert fires.
The problem with logs
Logs only show what we choose to log
When we write logging code, we’re making predictions about what will matter when things go
wrong. Sometimes these predictions are right, but sometimes they're not.
Imagine a simple data transformation task that fails. The application logs might tell you:
⚠ "Task failed with exit code 1"
⚠ "Out of memory error"
⚠ Timestamp of failure
But the logs won’t give you the specific details you need like:
- Which specific subprocess consumed the memory
- What data volume triggered the failure
- How memory usage escalated over time
- Which competing processes were running
- What kernel-level resource constraints were hit
If you didn’t predict you'd need this context, you probably didn’t choose to log it. So when an
alert fires, the context you need to resolve the issue isn’t there.
Log gaps compound across distributed systems
Pipelines are distributed systems. A single DAG might involve:
- Airflow for orchestration
- Spark for compute
- S3 for storage
- dbt for transformations
- Great Expectations for quality checks
Each tool has its own logging format, dumps different amounts of data, and decides what
matters. Some log to stdout, others to files and others still to specialized backends. When an
incident spans multiple systems, you end up needing to correlate error messages across
different formats and filling in gaps where systems didn't log anything.
Logs are reactive, not observational
We've all been there. An alert fires so you check the logs, only to find that they don't have what
you need. So you add more logging, deploy, and wait for it to happen again.
Sometimes the new logs help. Often they don't, and you add even more logging and repeat.
You're debugging in production by asking past-you what they thought would matter. Past-you
didn't know. That's why you're here in the first place.
The problem is you're working backwards from incomplete information. Logs show you what
someone decided to capture, not what actually happened. And by the time you realize the logs
are missing something, the incident is already over.
Logs disappear when you need them most
A pipeline fails overnight. You check the logs in the morning only to find they're gone.
Logs are ephemeral. They get overwritten, or disappear when containers restart. In Kubernetes,
when a pod crashes, its logs often vanish with it.
Retention policies prioritize storage costs over investigative needs, so you often end up needing
to start over.
The future: Execution-level observability
Execution-level observability works differently. Instead of asking applications to report on
themselves, you observe the operating system's record of what actually ran.
The kernel sees everything
Every process, system call, file operation, network request, and resource allocation goes
through the Linux kernel. It's the single source of truth for what executed on a machine. With
eBPF (extended Berkeley Packet Filter), you can tap into this data without modifying application
code or impacting performance.
When a data pipeline task executes, kernel-level observability captures:
- Exact process tree (parent, children, subprocesses)
- System calls made by each process
- Files opened, read, written, closed
- Network connections established
- Memory allocated and freed
- CPU scheduling decisions
- Disk I/O patterns
You get a detailed report on what actually happened, and you didn’t need to see into the future
to set it up.
Execution data maps to pipelines
Of course, raw kernel events aren't useful by themselves. The magic is mapping execution-level
signals to pipeline constructs you care about:
- This Spark job → these specific JVM processes → these system calls → this I/O pattern
- This Airflow task → these subprocess executions → these network requests → this storage access
- This data quality check → this Python interpreter → these file reads → this memory spike
When you can correlate kernel-level execution with pipeline tasks, you can get a clear picture
into what actually ran.
Cross-system correlation without log parsing
Because execution observation happens at the kernel level, it's framework-agnostic. It doesn't
matter if you're running Airflow, Prefect, Dagster, or custom scripts. It doesn't matter if your
compute is Spark, Pandas, or dbt. The kernel sees all of it.
This means you can correlate signals across your entire stack without parsing different log
formats, aligning timestamps across systems, filling in gaps where tools don't log or guessing at
causation from incomplete data.
Execution-level observability in practise: Real incidents logs miss (but execution data catches)
Let's look at specific examples where logs fail but execution-level observability succeeds.
Example 1: The silent OOM kill
What happened : A Spark job fails intermittently. Sometimes it succeeds. Sometimes it doesn't.
There is no clear pattern to the failures.
What the logs say :
`
Task failed with exit code 137
Lost connection to executor
`
Exit code 137 suggests the process was killed, but why? The logs don't say. Was it OOM?
Resource limits? A kill signal? The Spark driver lost the executor but has no visibility into what
happened on that node.
What execution data shows :
The kernel recorded that:
- The JVM process allocated 15.2GB of memory
- The cgroup memory limit was 16GB
- Another process on the same node consumed an additional 2GB
- The kernel OOM killer terminated the Spark executor
- A specific data partition (partition_date=2024-01-15) triggered the memory spike
The incident wasn't random. A specific date partition was 3x larger than typical. The job worked
fine until hitting that partition, then exceeded memory limits and was killed. Logs never saw this
because the kernel killed the process before it could report anything.
Example 2: The cross-system authentication timeout
What happened : Pipeline tasks fail with "connection timeout" errors when reading from cloud
storage.
What the logs say :
`
Failed to read s3://bucket/path/file.parquet
Connection timeout after 30s
Retrying (attempt 2/3)...
`
The application reports timeouts. Retries eventually succeed. But why are requests timing out in
the first place?
What execution data shows :
The kernel captured:
- Initial S3 request opened a TCP connection successfully
- Connection stalled waiting for response
- Meanwhile, a background credential refresh process was running
- The credential refresh made its own API calls to AWS STS
- Network connection limit was reached
- Original S3 request timed out waiting for available connection slot
- After credential refresh completed, connection slots freed up
- Retry succeeded
The root cause was a credential refresh process competing for network resources. Neither the
application nor S3 logs showed this because the timeout happened at the connection pool level,
not in the storage API. The kernel saw the complete picture: connection establishment, resource
contention, timeout, and eventual success.
Example 3: The cascading failure that logs blamed on the wrong component
What happened : Data quality checks start failing. The team investigates Great Expectations,
assuming the issue is in validation logic.
What the logs say :
`
[Great Expectations] Validation failed: unexpected null values in column 'user_id'
[Airflow] Task 'validate_data' failed
`
Looks like a data quality issue, right? The validation logs clearly show null values where they
shouldn't exist.
What execution data shows :
The kernel revealed:
- The upstream dbt transformation completed successfully according to its logs
- But the dbt process was killed mid-write (SIGTERM)
- The partial output file was still written to disk
- Airflow marked the dbt task as "success" based on exit code
- Great Expectations validated the incomplete file
- Nulls were present because the file was truncated
The validation didn't fail because of bad data. It failed because it validated a partially-written file
from a killed process. Application logs from three different tools all missed this because each
tool only saw its own narrow slice: dbt thought it succeeded, Airflow trusted the status code, and
Great Expectations accurately reported what it saw in the file. Only kernel-level execution data
showed the kill signal and incomplete write.
From log-driven to execution-driven investigation
Moving from logs to execution-level observability changes how you investigate incidents.
Old approach: Log archaeology
New approach: Execution replay
1. Incident fires
1. Incident fires
2. Check application logs for errors
2. Query execution data for the exact time window
3. Expand search to adjacent systems
3. See complete process tree, system calls, resource usage
4. Parse timestamps to correlate events
4. Map execution to pipeline tasks automatically
5. Fill in gaps with educated guesses
5. Identify root cause from observed behavior
6. Deploy additional logging
7. Wait for recurrence
Timeline: Hours to days
Timeline: Minutes
Accuracy: Questionable
Accuracy: What actually happened
The investigation you can't do with logs
With execution-level data, you can answer questions logs can't, like:
- "What else was running when this task failed?" Kernel shows all processes, not just the one that logged errors.
- "Did this process actually complete, or was it killed?" Exit codes and signals tell the real story, not just "task succeeded" logs.
- "Which subprocess consumed the resources?" Process tree attribution shows exactly where memory/CPU went.
- "What triggered this cascade of failures?" Execution timeline shows causation, not just correlation from timestamps.
- "Has this exact execution pattern failed before?" Compare kernel signatures across incidents to find repeat issues.
How teams can move beyond logs
If execution-level observability offers better visibility, what's holding teams back? Understanding
the common concerns can help you evaluate whether this approach makes sense for your
pipelines.
"This seems like a big shift from what we know"
Log-based monitoring has been the standard for decades. Most teams have built deep expertise
in log parsing, query languages, and aggregation tools. Moving to execution-level observability
means rethinking what observability can be.
Moving beyond logs doesn’t mean you need to replace everything you know. The concepts are
similar. You're still investigating incidents, correlating signals, and finding root causes. The
difference is the data source. Instead of application logs, you're working with execution data
that's already being captured by the kernel.
"Won't this require instrumenting our entire codebase?"
This is a common concern, and it's understandable. Traditional observability approaches
require:
- Adding logging statements to code
- Installing SDKs or agents
- Framework-specific instrumentation
- Redeploying applications
eBPF-based observability works differently. The Linux kernel already tracks every process,
syscall, file operation, and network request. There's no code to change, no agents to install, no
redeployment needed. You're tapping into execution data that already exists.
"How do we make kernel data useful for data engineering?"
Raw kernel events like syscalls, process spawns and file I/O are too low-level for day-to-day
pipeline work.
This is where semantic mapping becomes critical. You need a system that can translate kernel
execution into pipeline concepts you already understand like Airflow tasks, Spark jobs, dbt
transformations, data quality checks. When this mapping works well, you get high-level pipeline
insights without needing to understand kernel internals.
You investigate a failed Airflow task the same way you always have. The difference is the
investigation is based on what actually executed, not what got logged.
Introducing Tracer: Execution-level observability with built in pipeline mapping
If execution-level observability is so valuable, why isn't everyone already doing it? The short
answer is that until recently, the technology and tooling weren't there. The longer answer
involves kernel access concerns, the gap between raw kernel events and pipeline context, and
the perception that "kernel-level" means "hard to deploy."
Tracer solves these problems through a four-layer architecture that goes from raw kernel data to
actionable pipeline insights.
Layer 1: eBPF extraction without the operational risk
Tracer uses eBPF (extended Berkeley Packet Filter) to extract execution data directly from the
Linux kernel. eBPF programs are verified before they load, run in a sandboxed environment,
and can't crash your system. The overhead is roughly 2% of system resources - low enough
that you won't notice it in production.
Layer 2: The semantic filter - from kernel events to pipeline context
The filter layer automatically recognizes pipeline-specific patterns. When a process spawns,
Tracer doesn't just see PID 47291 - it recognizes this is a Spark executor, running as part of
task "transform_user_data" in your Airflow DAG.
Tracer maps kernel execution to the constructs you actually care about: orchestrator tasks,
compute jobs, data transformations, quality checks. You get high-level pipeline visibility from
low-level execution data.
Layer 3: Synthetic logs - filling the gaps
When Tracer observes execution that isn't being logged, it generates synthetic logs in
OpenTelemetry format. If a subprocess gets killed by the OOM killer before it can report
anything, Tracer saw it happen at the kernel level and creates a log entry for it.
You end up with a complete timeline of what executed, even for components that are silent in
traditional logging.
Layer 4: From data to insights
The insights layer runs automated root cause analysis on execution data. When an alert fires,
Tracer has already:
- Built multiple hypotheses about what went wrong
- Tested them against the execution timeline
- Correlated signals across orchestrators, compute, storage, and data quality tools
- Generated a report with recommended fixes
You're not staring at dashboards trying to piece together what happened. The investigation is
already done.
Execution-level observability without the operational overhead
What used to require specialized eBPF knowledge, custom instrumentation, and manual
correlation is now a deployable system. You get:
- Kernel-level visibility without kernel-level complexity
- Pipeline semantics without framework lock-in
- Complete execution timelines without waiting for logs
- Automated investigations without building correlation logic
The technology barrier that kept execution-level observability niche doesn't exist anymore.
What's left is deciding whether you want to keep debugging with incomplete information or move
to observing what actually ran.
Moving past log archaeology to execution observation
We've been debugging pipelines the same way for twenty years. We check the log, correlate
timestamps and fill in the blanks. We add more logging, wait for the issue to happen again and
repeat.
It works, mostly. But it's slow, and it gets worse as your pipelines get more complex.
Execution-level observability is different.
Instead of asking what the application logged, you're
seeing what actually ran. You’re looking at the complete execution timeline instead of piecing
together fragments. And you're not waiting for incidents to recur so you can add better logging -
the data's already there.
Execution-level observability opens the door for automated incident investigation - AI agents
that identify root causes, test hypotheses in parallel and deliver full incident reports before you
even get pinged.
Ready for a new approach to incident investigation?
Get started for free at tracer.cloud.