Why We're Building Tracer: The Future of Scientific Computing Observability
More

Monitoring Airflow with Prometheus, StatsD, and Grafana

How to have real time insights into Airflow, using Prometheus, StatsD, and Grafana.

const metadata = ; Monitoring Apache Airflow in production requires robust metrics and a unified view for quick debugging. Instead of manually checking the Airflow UI, log files, and disparate tools, a centralized monitoring stack can provide a real-time overview of Airflow's health and performance. By leveraging StatsD for metric emission, a Prometheus StatsD Exporter as a bridge, Prometheus for metric storage, and Grafana for visualization, an Airflow engineer can easily answer questions like: _Is the scheduler up? How many tasks are running or queued? Which DAGs are slow or failing?_ This guide outlines an Airflow monitoring setup using these tools, assuming the reader is already familiar with Airflow and basic monitoring concepts. Airflow Metrics Architecture and Components The Monitoring Process In a Nutshell Apache Airflow has built-in support for emitting metrics to StatsD, a simple UDP-based metrics aggregation service. Its components (Scheduler, Webserver, Workers) will send metrics to StatsD-format. A Prometheus StatsD Exporter then receives those metrics and exposes them as Prometheus metrics for scraping. Prometheus will continuously scrape the exporter's HTTP endpoint and store the time-series data. Finally, Grafana connects to Prometheus as a data source to query and display the Airflow metrics on dashboards. The data flow can be summarized as: Airflow cannot natively emit metrics to Prometheus. Therefore, StatsD Exporter is needed to bridge the gap between Airflow and Prometheus by translating dot-separated StatsD metrics names into a Prometheus format. By default, the exporter will replace dots with underscores and treat the entire metric name as a single Prometheus metric (without labels). For example, an Airflow metric named airflow.dag.my_dag.my_task.duration would become a Prometheus metric airflow_dag_my_dag_my_task_duration, this long name encoding DAG and task identifiers is hard to filter or group by in queries. To make the data more usable, we configure the exporter with mapping rules that split such names into a base metric plus labels for the dynamic parts. But why use metric mapping? Airflow's metrics are hierarchical (many include DAG ID, task ID, etc., in their name). Prometheus works best with labels for these dimensions. The StatsD Exporter supports a mapping configuration to transform metrics. For instance, Airflow emits a task duration metric airflow.dag...duration (this times how long a task run took in milliseconds). We can map this to a Prometheus metric like airflow_task_duration_seconds (converting milliseconds to seconds) with labels for the Airflow instance, DAG, and task. This way, instead of separate metrics for every task, we get one metric we can slice by labels (dag_id, task_id, etc.). Deploying StatsD Exporter with Mapping Configuration Our first component to set up is the Prometheus StatsD Exporter, which will listen for Airflow's StatsD metrics and expose them for Prometheus. You can run the official exporter as a Docker container or binary. For example, using Docker: `bash docker run -d -p 9125:9125/udp -p 9102:9102 \ -v $PWD/statsd_mapping.yml:/tmp/statsd_mapping.yml \ prom/statsd-exporter --statsd.mapping-config=/tmp/statsd_mapping.yml ` In this example, the container listens on UDP port 9125 for incoming StatsD packets (Airflow will be configured to send metrics here), and exposes an HTTP endpoint on port 9102 for Prometheus to scrape. We mount a StatsD mapping configuration file into the container and tell the exporter to use it via the --statsd.mapping-config flag. This mapping file is crucial for translating Airflow's dot-notated metrics into labeled Prometheus metrics. Defining mapping rules: By default, the exporter would convert dots to underscores with no labels. To improve this, we define mappings that capture portions of the metric name as Prometheus labels. The StatsD Exporter supports a simple glob-pattern mapping language: * wildcards match tokens in the metric name, and captured tokens (like $1, $2, etc.) can be used in the resulting metric name or labels. Each mapping rule specifies a pattern, an output metric name, optional labels, and the StatsD metric type to match (counter, gauge, or timer). For example, to map the task duration metric: `yaml mappings: - match: ".dag..*.duration" match_metric_type: observer matches StatsD timers name: "airflow_task_duration_seconds" labels: airflow_id: "$1" dag_id: "$2" task_id: "$3" ` This rule matches any metric that fits .dag..*.duration, for instance, a metric airflow.dag.example_dag.my_task.duration would match with $1="airflow", $2="example_dag", $3="my_task". It produces a Prometheus metric airflow_task_duration_seconds with the observed value (duration in seconds). In practice, you will create multiple mapping rules to cover various Airflow metrics that embed identifiers. Airflow emits metrics for DAG runs, task outcomes, scheduler status, etc., many of which include the DAG ID, task ID, or operator in their name. Each of these can be mapped to a labeled metric (for example, counters like ti_failures and ti_successes for task instance failures/successes can be mapped with labels for the DAG and task). However, writing all mappings manually can be very tedious, luckily the community has provided examples. The [Databand project](https://github.com/databand-ai/airflow-dashboards#:~:text=Grafana%20dashboards%20and%20StatsD,for%20Airflow%20monitoring) has open-sourced a complete StatsD mapping config covering most default Airflow metrics, which you can adapt for your setup. Once the StatsD Exporter is running and the mappings are in place, configure Prometheus to scrape the exporter's metrics. In your prometheus.yml, add a job for the StatsD Exporter, for example: `yaml scrape_configs: - job_name: 'airflow-metrics' static_configs: - targets: [':9102'] ` Replace with the hostname (or service name) where the StatsD Exporter is running (use localhost:9102 if Prometheus runs on the same server/container as the exporter). This tells Prometheus to scrape the exporter's /metrics HTTP endpoint on port 9102 at the regular interval. With this in place, Prometheus will begin ingesting the Airflow metrics exposed by the exporter. Enabling StatsD Metrics in Airflow Next, Airflow must be configured to emit StatsD metrics. By default, Airflow's metric collection is off, so you need to enable it in airflow.cfg (or via environment variables) and ensure the StatsD client library is installed. Install the StatsD client: Airflow uses the Python statsd package to send metrics. If you installed Airflow via pip, include the StatsD extra (e.g. pip install 'apache-airflow[statsd]') to get the client installed. Many Airflow distributions already include this, but it's good to verify. Update airflow.cfg: Set the StatsD settings under the [metrics] section of your config: `ini [metrics] statsd_on = True statsd_host = localhost statsd_port = 9125 statsd_prefix = airflow ` These values tell Airflow to enable StatsD metrics and send them to statsd_host:statsd_port. Use the host and UDP port where your StatsD Exporter is listening. In our Docker example, we used port 9125 for StatsD, so we specify that here (Airflow's default port is 8125, which you would use if your exporter is listening on 8125). The statsd_prefix (set to "airflow") will prefix all metric names; if you have multiple Airflow environments, you could give each a distinct prefix to differentiate their metrics (this prefix becomes the first token in the metric name, which in our mapping we captured as the airflow_id label). After updating the config, restart Airflow components (Scheduler, Webserver, Workers) so the new metrics settings take effect. Important: Ensure the StatsD Exporter service is running and reachable at the configured host/port _before_ Airflow starts. On startup, Airflow will send a test metric; if the StatsD endpoint is unreachable, Airflow may log an error and fall back to a no-op metrics client (meaning you'd get no metrics). Verifying connectivity at this stage prevents silent failures of metric collection. Once Airflow is up and running with StatsD enabled, it will continuously emit metrics in the background. You can confirm metrics are flowing by checking the StatsD Exporter's logs for incoming data or by querying Prometheus for a simple metric (for example, the Airflow scheduler heartbeat counter) after a few minutes. Grafana Dashboards for Airflow Metrics Finally, with Prometheus collecting Airflow metrics, Grafana can be used to visualize the data. In Grafana, add Prometheus as a data source. The next step is to build dashboards that present the metrics in a meaningful way for Airflow monitoring. Typically, it's useful to create at least two kinds of dashboards: a cluster-wide overview and a more granular DAG-specific performance dashboard. Below we describe examples of each. (You can design these dashboards yourself with PromQL queries, or import pre-built ones such as those provided by the [Databand](https://medium.com/databand-ai/everyday-data-engineering-monitoring-airflow-with-prometheus-statsd-and-grafana-d9abef802699) project on [github.com](https://github.com/databand-ai/airflow-dashboards#:~:text=Grafana%20dashboards%20and%20StatsD,for%20Airflow%20monitoring) .) Cluster Overview Dashboard Another panel can compare running versus queued tasks. While Airflow does not expose a direct "running tasks" gauge, this can be inferred from executor metrics (for example, with the CeleryExecutor: running tasks = total slots minus executor.open_slots). Similar calculations apply to other executors. The dashboard can also chart task success and failure rates over time and break them down by DAG or operator using metric labels. Scheduler backlog indicators are equally important: rising counts of queued or starving tasks (for example, scheduler.tasks.starving or scheduler.tasks.executable) can signal capacity issues. For Celery or Kubernetes executors, worker or pod counts and resource usage may be included from external monitoring. Overall, the cluster overview acts as a system heartbeat, making scheduler failures, task backlogs, or DAG import issues immediately visible. DAG Performance Dashboard For deeper insights into specific workflows, a DAG-level dashboard lets you drill down on one DAG at a time. As illustrated in the image above, such a dashboard uses a dropdown to select a particular dag_id, and then all panels update to show data for that DAG. Key metrics include: - DAG run duration: how long each run of the DAG took to complete, which is often split by status (success vs. failed). This can be plotted using the metrics we mapped from Airflow's dagrun.duration.success and dagrun.duration.failed. - DAG schedule delay: measures how late a DAG run started compared to its scheduled time (Airflow emits dagrun.schedule_delay. for this, typically in milliseconds). A growing schedule delay can indicate that the scheduler is falling behind. - DAG's dependency check time: how long the scheduler spends evaluating this DAG's task dependencies before scheduling tasks (Airflow emits dagrun.dependency-check. for that). If this time is large or increasing, it might point to performance issues in scheduling or very complex DAG structures. - Count of task failures vs. successes within the DAG to check errors frequently (using the task outcome metrics filtered by dag_id). With our metrics labeled by DAG and task, we could even add panels for individual task durations or failure rates in the DAG, if needed, by filtering on both dag_id and a specific task_id. By focusing on one DAG, the dashboard helps pinpoint bottlenecks or reliability issues in that pipeline. For instance, if a particular DAG's runs are getting slower over time, the DAG dashboard will make it evident which metric is rising. This information guides the engineer to investigate the root cause, which might be an external service call if a task is slow, or data volumes have increased. Dashboard implementation: To build these dashboards, you will write PromQL queries to extract and aggregate the metrics. For example, a query for the scheduler heartbeat might simply display the latest value of the counter or the rate of increase. A query for DAG run duration might take the maximum or average of airflow_dagrun_duration_seconds over a time window, or plot individual run durations as a series. You can also use PromQL functions like increase() for counters to calculate rates or histogram_quantile() if using histogram metrics. If crafting these from scratch sounds daunting, you can import community dashboards. The [Databand.ai]((https://github.com/databand-ai/airflow-dashboards#:~:text=Grafana%20dashboards%20and%20StatsD,for%20Airflow%20monitoring)) example repository provides JSON definitions for a cluster overview and a DAG detail dashboard that correspond to the figures above, which can be imported into Grafana and tailored to your environment. Using such templates can jump-start your monitoring setup and ensure you're not missing any important metrics. Conclusion Integrating Apache Airflow with StatsD, Prometheus, and Grafana provides a cohesive and production-ready monitoring solution. Airflow's native StatsD metrics, when translated and labeled via the Prometheus StatsD Exporter, become highly queryable time-series data stored in Prometheus and visualized through Grafana dashboards. This stack delivers real-time visibility into scheduler health, task throughput, DAG performance, and executor behavior, replacing fragmented, manual monitoring with a unified observability layer. Beyond day-to-day visibility, this setup accelerates troubleshooting. Performance regressions, frequent failures, scheduler slowdowns, or capacity constraints are surfaced quickly through cluster-level and DAG-specific dashboards, allowing engineers to pinpoint root causes with minimal effort. Once established, the metrics pipeline can be naturally extended with alerting (via Alertmanager or Grafana alerts), turning dashboards into a proactive early-warning system. Overall, this monitoring framework enables teams to operate and scale Airflow with confidence, backed by clear, actionable insights into their workflows. Other Sources: [airflow.apache.org](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#:~:text=To%20use%20StatsD%20you%20must,first%20install%20the%20required%20packages) [redhat.com](https://www.redhat.com/en/blog/monitoring-apache-airflow-using-prometheus#:~:text=It%20turns%20out%20that%20the,Overall%2C%20the%20Airflow)
Background

Get Started Now

Ready to See
Tracer In Action?

Start for free or
Tracer Logo

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy