Visibility Is All You Need
An idea for a new high compute monitoring system to improve productivity and efficiency for scientific computing
const metadata = ;
AI is now shifting from helping humans be more productive to discovering new science on its own. OpenAI has set concrete goals: routine small discoveries by 2026 and meaningful contributions to major breakthroughs by 2028. Hitting those milestones will require far more compute than today’s data-center capacity can realistically provide.
The bottleneck is compute, and there isn’t enough of it. A massive global buildout is starting to unfold to meet that demand. China has started to build nuclear power plants to output [118 GW](https://globalenergymonitor.org/report/china-is-building-half-of-the-worlds-new-nuclear-power-despite-inland-plants-pause/) (± half the world’s current energy production), Google is exploring [space-based data centers](https://research.google/blog/exploring-a-space-based-scalable-ai-infrastructure-system-design/), and everyone is scrambling to unlock more compute. These efforts will take years to materialize, which raises a more immediate question: what can teams do today?
This is where the real pattern emerges. Across teams and environments, we keep seeing the same problem repeat itself once pipelines scale beyond a laptop. A large share of available compute gets wasted simply because teams can’t easily see how their pipelines behave or where inefficiencies come from. Thirty to forty percent waste is not unusual, especially for heavy scientific workloads.
That’s why we’re building Tracer: a visibility layer that makes pipeline inefficiencies obvious and lets teams surface the biggest optimization opportunities in a couple clicks.

Why Scaling a Pipeline from a Laptop to HPC/Cloud Is Hard
Modern life science AI depends on bioinformatics pipelines. Models like AlphaFold, Enformer, and gene-expression predictors only work because pipelines generate the sequences, alignments, annotations, and labels they learn from.
Prototype pipeline versions fit on a laptop, but real discovery involves thousands of samples on remote high-compute clusters, GPU farms, or cloud batch systems that process many samples in parallel. To get there, most teams follow the same progression:
- A researcher starts by developing a pipeline on a local laptop envrionment
- The pipeline is then formalized in a workflow engine such as Nextflow, Snakemake, or WDL
- The workflow is then deployed onto an HPC cluster or cloud batch environment
The pain starts at step three. The common failure modes are well known:
- Pipelines that ran cleanly locally take days to debug in the cloud due to opaque permission, environment, or container issues.
- At scale, 5 to 10 percent of tasks fail with generic exit codes. No one can tell whether the cause is a tool bug, a corrupted input file, or a transient node issue.
- A frozen task can be killed mid-run while the workflow still reports success. The team discovers corrupted output weeks later.
- Tasks run longer than expected, so teams overprovision CPU, GPU, and memory to stay safe. Cost and planning become unpredictable.
These failures reflect a deeper problem: scientific workloads lack a consistent source of telemetry that connects pipeline steps to what is happening on the machines actually executing them.

The Telemetry Gap in Scientific Computing
Traditional software systems are easy to monitor because their telemetry follows shared standards like JSON, HTTP metadata, and OpenTelemetry. One tool can collect and correlate everything.
Scientific computing has no such foundation. Its telemetry is scattered across independent systems that were never designed to interoperate:
- container stdout and stderr
- tool-specific logs from STAR, minimap2, GATK, Cell Ranger
- CPU, RAM, disk, and network metrics from the OS
- scheduler logs from Slurm, LSF, AWS Batch, Kubernetes
- workflow engine logs from Nextflow, Snakemake, WDL, Airflow
- storage and data transfer logs
None of these components share formats or timing required to debug at scale. Engineers end up doing manual correlation across layers with no common context. And critically, the failures that actually break pipelines live at the OS and tool level: memory pressure, I/O stalls, missing reference files, tool-internal deadlocks, GPU driver issues. Pipeline-level summaries like “task started” or “task completed” don’t expose any of this.
This is the key problem in monitoring high-compute scientific workloads, and our work targets exactly this.
Why We Built From the Operating System Up
Every scientific pipeline, no matter how it is written, eventually reduces to a Linux process. This is the only universal layer across cloud, HPC, and on-prem environments.
Higher-level systems only see scheduling decisions or container output, not actual execution behavior. Workflow engines understand orchestration, not what happens inside a running task. Schedulers show queues, not failures. Containers expose stdout and stderr, not memory pressure or I/O stalls. Cloud metrics describe a node, but not the specific task that triggered a spike.
The operating system is the first place where all signals converge. OS-level visibility makes scientific compute debuggable, optimizable, and predictable at scale, something the field has never had.
We believe OS monitoring will become the foundation for a new category of high-compute observability built for the scale of modern digital data processing
The Opportunity In Front of Us
AI is beginning to generate real scientific discoveries, but only if the infrastructure underneath is available. Today, there simply isn’t enough compute or electricity to meet the demands of AI and efficiency is becoming the bottleneck in scientific computing.
Wasted CPU, GPU, and memory slow research more than any algorithmic limitation. And without observability, pipelines can’t be optimized, resources get overprovisioned, failures stay hidden, and teams lack the rightsizing tooling needed to run at full efficiency.
The next phase of AI will be defined by how well we use the compute that is available.
Tracer is our attempt to build an OS-level visibility layer so high-compute infrastructure can actually keep up with scientific AI.