Automatic WireGuard Failover with NixOS and Prometheus

Published January 18, 2026

When running services through a WireGuard VPN, a single server becomes a point of failure. If it goes down or starts dropping packets, everything depending on that tunnel breaks until you manually switch.

Here’s how I built automatic failover for multiple WireGuard endpoints using NixOS systemd services and Prometheus metrics.

Architecture Overview

The solution has three components:

wg-select - Picks the initial VPN server on boot
wg-failover - Switches to the best available server when triggered
wg-health-check - Monitors packet loss and triggers failover when degraded

All metrics come from Prometheus via two smokeping probers with different jobs:

External prober (on the hypervisor) - Continuously pings all VPN server endpoints. This gives you packet loss data for every server, even the ones you’re not connected to. Essential for making informed failover decisions.
Internal prober (inside VPN namespace) - Pings root DNS servers through the active tunnel. This measures actual tunnel quality and triggers failover when degraded.

The external prober is the key insight: you can’t measure tunnel quality for a server you’re not connected to, but you can measure basic reachability. When failover triggers, the system picks the server with the lowest packet loss from the external prober’s perspective.

Server Selection on Boot

The wg-select service runs before the WireGuard service starts. It either restores the previous selection from a state file or picks a random server on first boot:

systemd.services.wg-select = {
  description = "Select initial VPN server";
  wantedBy = [ "wg.service" ];
  before = [ "wg.service" ];
  serviceConfig = {
    Type = "oneshot";
    RemainAfterExit = true;
  };
  script = ''
    set -euo pipefail
    STATE_FILE="/var/lib/wg-failover/current"
    CONF_LINK="/run/wg-active.conf"
    SERVERS=(server1 server2 server3)

    if [[ -f "$STATE_FILE" ]]; then
      CURRENT=$(cat "$STATE_FILE")
    else
      mkdir -p /var/lib/wg-failover
      CURRENT=''${SERVERS[$RANDOM % ''${#SERVERS[@]}]}
      echo "$CURRENT" > "$STATE_FILE"
    fi

    ln -sf "/run/secrets/wireguard/$CURRENT" "$CONF_LINK"
  '';
};

The WireGuard service uses /run/wg-active.conf as its config file, which is just a symlink to whichever server config is currently selected.

Intelligent Failover

When failover triggers, the system queries Prometheus for packet loss metrics on each server and picks the one with the lowest loss:

systemd.services.wg-failover = {
  description = "VPN failover - switch to best available server";
  path = [ pkgs.curl pkgs.jq pkgs.gawk ];
  serviceConfig.Type = "oneshot";
  script = ''
    set -euo pipefail
    PROMETHEUS_URL="https://prometheus.example.com"

    declare -A SERVER_IPS=(
      [server1]="203.0.113.10"
      [server2]="203.0.113.20"
      [server3]="203.0.113.30"
    )

    CURRENT=$(cat /var/lib/wg-failover/current)
    BEST_SERVER=""
    BEST_LOSS="999"

    for server in "''${!SERVER_IPS[@]}"; do
      [[ "$server" == "$CURRENT" ]] && continue

      ip="''${SERVER_IPS[$server]}"
      loss=$(curl -sk --max-time 5 "$PROMETHEUS_URL/api/v1/query" \
        --data-urlencode "query=1 - (rate(smokeping_response_duration_seconds_count{host=\"$ip\"}[5m]) / rate(smokeping_requests_total{host=\"$ip\"}[5m]))" \
        | jq -r '.data.result[0].value[1] // "1"')

      if awk "BEGIN {exit !($loss < $BEST_LOSS)}"; then
        BEST_LOSS="$loss"
        BEST_SERVER="$server"
      fi
    done

    # Fallback to random if Prometheus unavailable
    if [[ -z "$BEST_SERVER" ]]; then
      SERVERS=(server1 server2 server3)
      AVAILABLE=()
      for s in "''${SERVERS[@]}"; do
        [[ "$s" != "$CURRENT" ]] && AVAILABLE+=("$s")
      done
      BEST_SERVER="''${AVAILABLE[$RANDOM % ''${#AVAILABLE[@]}]}"
    fi

    echo "$BEST_SERVER" > /var/lib/wg-failover/current
    ln -sf "/run/secrets/wireguard/$BEST_SERVER" /run/wg-active.conf
    systemctl restart wg.service
  '';
};

The failover service skips the currently failing server and picks the healthiest alternative. If Prometheus is unreachable, it falls back to random selection.

Health Check Timer

A timer runs every minute to check the VPN’s actual packet loss. It uses consecutive failure counting to avoid flapping:

systemd.services.wg-health-check = {
  description = "VPN health check - trigger failover on sustained packet loss";
  after = [ "wg.service" ];
  path = [ pkgs.curl pkgs.jq pkgs.bc ];
  serviceConfig.Type = "oneshot";
  script = ''
    set -euo pipefail
    STATE_FILE="/var/lib/wg-failover/health-failures"
    PROMETHEUS_URL="https://prometheus.example.com"
    THRESHOLD="0.15"  # 15% packet loss
    MAX_FAILURES=3    # Trigger after 3 consecutive failures

    LOSS=$(curl -sk --max-time 5 "$PROMETHEUS_URL/api/v1/query" \
      --data-urlencode 'query=1 - (rate(smokeping_response_duration_seconds_count{job="vpn-prober"}[5m]) / rate(smokeping_requests_total{job="vpn-prober"}[5m]))' \
      | jq -r '.data.result[0].value[1] // "1"')

    if (( $(echo "$LOSS > $THRESHOLD" | bc -l) )); then
      FAILURES=$(cat "$STATE_FILE" 2>/dev/null || echo 0)
      FAILURES=$((FAILURES + 1))
      echo "$FAILURES" > "$STATE_FILE"

      if [[ $FAILURES -ge $MAX_FAILURES ]]; then
        echo "0" > "$STATE_FILE"
        systemctl start wg-failover.service
      fi
    else
      echo "0" > "$STATE_FILE"
    fi
  '';
};

systemd.timers.wg-health-check = {
  wantedBy = [ "timers.target" ];
  timerConfig = {
    OnBootSec = "2min";
    OnUnitActiveSec = "1min";
    RandomizedDelaySec = "10s";
  };
};

Three consecutive checks above 15% packet loss triggers failover. A single successful check resets the counter.

Triggering Failover on Service Failure

The WireGuard service itself triggers failover when it fails to start (e.g., DNS resolution fails at boot):

systemd.services.wg.serviceConfig = {
  Restart = "on-failure";
  RestartSec = "30s";
};
systemd.services.wg.unitConfig.OnFailure = "wg-failover.service";

Monitoring All VPN Endpoints (External Prober)

The external prober runs on the hypervisor, outside any VPN namespace. It monitors all VPN server endpoints directly:

services.prometheus.exporters.smokeping = {
  enable = true;
  hosts = [
    # General internet connectivity
    "8.8.8.8"
    "1.1.1.1"
    # All VPN server endpoints
    "203.0.113.10"  # server1
    "203.0.113.20"  # server2
    "203.0.113.30"  # server3
  ];
};

This gives you continuous packet loss metrics for every server. When the failover script queries Prometheus for smokeping_response_duration_seconds_count{host="203.0.113.20"}, it’s using data from this prober.

Measuring Tunnel Quality (Internal Prober)

The internal prober runs inside the VPN namespace and measures actual tunnel quality by pinging through it:

systemd.services.prometheus-smokeping-vpn = {
  after = [ "wg.service" ];
  wants = [ "wg.service" ];
  wantedBy = [ "multi-user.target" ];
  serviceConfig = {
    ExecStart = "${pkgs.prometheus-smokeping-prober}/bin/smokeping_prober --privileged --web.listen-address=0.0.0.0:9374 198.41.0.4 192.33.4.12";
    NetworkNamespacePath = "/var/run/netns/wg";
    AmbientCapabilities = [ "CAP_NET_RAW" ];
    Restart = "always";
  };
};

It pings root DNS servers (a.root-servers.net, c.root-servers.net) through the active tunnel. The health check service queries this prober’s metrics to detect tunnel degradation.

Monitoring the Current Server

For visibility in Grafana, export the current server as a Prometheus metric:

systemd.services.wg-metrics = {
  wantedBy = [ "wg.service" ];
  after = [ "wg-select.service" ];
  serviceConfig = {
    Type = "oneshot";
    RemainAfterExit = true;
  };
  script = ''
    CURRENT=$(cat /var/lib/wg-failover/current)
    cat > /var/lib/prometheus-node-exporter/vpn_server.prom << EOF
    # HELP vpn_server_info Current VPN server
    # TYPE vpn_server_info gauge
    vpn_server_info{server="$CURRENT"} 1
    EOF
  '';
};

Results

When a server becomes degraded, the system automatically switches within 3-4 minutes. The smokeping metrics in Grafana show exactly when failovers happen and why.

The key insight was using Prometheus as the source of truth for server health rather than trying to probe servers from the failover script itself. This means the failover decision is based on the same metrics used for alerting, and you can see the historical data that led to each switch.