Back to all posts

Monitoring the *arr Stack on NixOS

The *arr applications (Radarr, Sonarr, Prowlarr) don’t make much noise when things go wrong. An indexer times out a few times and gets silently disabled. Downloads sit in a queue because of a permissions issue. You find out days later when you check why that show never appeared. I wanted alerts for these situations, so I set up monitoring with Prometheus and Loki.

Getting Metrics with Exportarr

The *arr apps have internal APIs but don’t expose Prometheus metrics natively. Exportarr bridges that gap - it queries each app’s API and exposes the data in Prometheus format.

On NixOS, I run exportarr as a systemd service for each app. It needs the app’s URL and an API key (which you can find in Settings > General in each app’s web UI):

{ config, pkgs, ... }:
{
  systemd.services.exportarr-radarr = {
    wantedBy = [ "multi-user.target" ];
    after = [ "radarr.service" ];
    serviceConfig = {
      ExecStart = ''
        ${pkgs.exportarr}/bin/exportarr radarr \
          --url http://localhost:7878 \
          --api-key-file ${config.sops.secrets."radarr/apikey".path} \
          --port 9707
      '';
      DynamicUser = true;
    };
  };
}

I use sops-nix for the API key, but you could also pass it via environment variable or a plain file. The same pattern works for Sonarr (port 8989, exporter on 9708) and Prowlarr (port 9696, exporter on 9709).

Once the exporters are running, add them to your Prometheus scrape config:

services.prometheus.scrapeConfigs = [
  {
    job_name = "radarr";
    static_configs = [{ targets = [ "localhost:9707" ]; }];
  }
  {
    job_name = "sonarr";
    static_configs = [{ targets = [ "localhost:9708" ]; }];
  }
];

Writing Alerts

Exportarr gives you metrics like radarr_system_status (1 when healthy, 0 when the app isn’t responding) and radarr_system_health_issues (a count of warnings shown in the System > Status page). There’s also radarr_queue_total with a download_state label that tells you if items are stuck.

For basic availability monitoring, I alert when the system status drops to zero:

services.prometheus.rules = [
  {
    alert = "RadarrDown";
    expr = ''radarr_system_status == 0'';
    for = "5m";
    labels.severity = "warning";
    annotations.summary = "Radarr is not responding.";
  }
  {
    alert = "SonarrDown";
    expr = ''sonarr_system_status == 0'';
    for = "5m";
    labels.severity = "warning";
    annotations.summary = "Sonarr is not responding.";
  }
];

The for = "5m" means the condition has to be true for five minutes before firing, which avoids alerts during brief restarts.

Import failures are worth watching too. When Radarr or Sonarr can’t move a downloaded file to the library (usually permissions or disk space), the item sits in the queue with download_state="importBlocked". I give this a longer duration since imports can legitimately take a while during post-processing:

{
  alert = "RadarrImportBlocked";
  expr = ''radarr_queue_total{download_state="importBlocked"} > 0'';
  for = "30m";
  labels.severity = "warning";
  annotations.summary = "Radarr has {{ $value }} items stuck in import queue.";
}

The health issues metric is a catch-all for problems that show up in the app’s status page - things like download client connectivity, indexer errors, or update notifications:

{
  alert = "RadarrHealthIssues";
  expr = ''radarr_system_health_issues > 0'';
  for = "15m";
  labels.severity = "warning";
  annotations.summary = "Radarr reports health issues - check System > Status.";
}

Log-Based Alerts with Loki

Some problems don’t surface in metrics. When Prowlarr can’t reach an indexer, it logs the connection failure and eventually disables that indexer. Radarr and Sonarr then log that the indexer has been disabled. These messages are useful for alerting.

If you’re already shipping systemd journal logs to Loki (via promtail or the journal scraper), you can write LogQL alert rules:

{
  alert = "ProwlarrIndexerDown";
  expr = ''
    count_over_time({unit="prowlarr.service"}
      |~ "Http request timed out|Unable to connect|Connection refused" [15m]) > 2
  '';
  "for" = "5m";
  labels.severity = "warning";
  annotations.summary = "Prowlarr cannot reach indexers.";
}

{
  alert = "RadarrIndexerDisabled";
  expr = ''
    count_over_time({unit="radarr.service"}
      |= "Indexer is disabled till" [15m]) > 0
  '';
  "for" = "5m";
  labels.severity = "warning";
  annotations.summary = "Radarr indexers disabled due to failures.";
}

The first rule looks for connection errors in Prowlarr logs and fires if it sees more than two in a 15-minute window. The second catches the specific message Radarr logs when it disables an indexer.

Investigating Issues

When an alert fires, I use logcli to dig into what happened:

logcli query '{unit="prowlarr.service"} |~ "(?i)error|timeout"' --limit=50 --since=24h

This searches Prowlarr’s logs for errors or timeouts in the last 24 hours. The |~ operator does regex matching. For exact substring matching, use |= instead.

You can also query across all the *arr services at once:

logcli query '{unit=~".*arr.service"} |= "error"' --since=1h

Between the Prometheus metrics for service health and Loki alerts for log patterns, I catch most problems before they become obvious. The *arr apps still don’t notify you proactively, but at least now something else does.

Comments