Automatic WireGuard Failover with NixOS and Prometheus
When running services through a WireGuard VPN, a single server becomes a point of failure. If it goes down or starts dropping packets, everything depending on that tunnel breaks until you manually switch.
Here’s how I built automatic failover for multiple WireGuard endpoints using NixOS systemd services and Prometheus metrics.
Architecture Overview
The solution has three components:
- wg-select - Picks the initial VPN server on boot
- wg-failover - Switches to the best available server when triggered
- wg-health-check - Monitors packet loss and triggers failover when degraded
All metrics come from Prometheus via two smokeping probers with different jobs:
- External prober (on the hypervisor) - Continuously pings all VPN server endpoints. This gives you packet loss data for every server, even the ones you’re not connected to. Essential for making informed failover decisions.
- Internal prober (inside VPN namespace) - Pings root DNS servers through the active tunnel. This measures actual tunnel quality and triggers failover when degraded.
The external prober is the key insight: you can’t measure tunnel quality for a server you’re not connected to, but you can measure basic reachability. When failover triggers, the system picks the server with the lowest packet loss from the external prober’s perspective.
Server Selection on Boot
The wg-select service runs before the WireGuard service starts. It either restores the previous selection from a state file or picks a random server on first boot:
systemd.services.wg-select = {
description = "Select initial VPN server";
wantedBy = [ "wg.service" ];
before = [ "wg.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
set -euo pipefail
STATE_FILE="/var/lib/wg-failover/current"
CONF_LINK="/run/wg-active.conf"
SERVERS=(server1 server2 server3)
if [[ -f "$STATE_FILE" ]]; then
CURRENT=$(cat "$STATE_FILE")
else
mkdir -p /var/lib/wg-failover
CURRENT=''${SERVERS[$RANDOM % ''${#SERVERS[@]}]}
echo "$CURRENT" > "$STATE_FILE"
fi
ln -sf "/run/secrets/wireguard/$CURRENT" "$CONF_LINK"
'';
};
The WireGuard service uses /run/wg-active.conf as its config file, which is just a symlink to whichever server config is currently selected.
Intelligent Failover
When failover triggers, the system queries Prometheus for packet loss metrics on each server and picks the one with the lowest loss:
systemd.services.wg-failover = {
description = "VPN failover - switch to best available server";
path = [ pkgs.curl pkgs.jq pkgs.gawk ];
serviceConfig.Type = "oneshot";
script = ''
set -euo pipefail
PROMETHEUS_URL="https://prometheus.example.com"
declare -A SERVER_IPS=(
[server1]="203.0.113.10"
[server2]="203.0.113.20"
[server3]="203.0.113.30"
)
CURRENT=$(cat /var/lib/wg-failover/current)
BEST_SERVER=""
BEST_LOSS="999"
for server in "''${!SERVER_IPS[@]}"; do
[[ "$server" == "$CURRENT" ]] && continue
ip="''${SERVER_IPS[$server]}"
loss=$(curl -sk --max-time 5 "$PROMETHEUS_URL/api/v1/query" \
--data-urlencode "query=1 - (rate(smokeping_response_duration_seconds_count{host=\"$ip\"}[5m]) / rate(smokeping_requests_total{host=\"$ip\"}[5m]))" \
| jq -r '.data.result[0].value[1] // "1"')
if awk "BEGIN {exit !($loss < $BEST_LOSS)}"; then
BEST_LOSS="$loss"
BEST_SERVER="$server"
fi
done
# Fallback to random if Prometheus unavailable
if [[ -z "$BEST_SERVER" ]]; then
SERVERS=(server1 server2 server3)
AVAILABLE=()
for s in "''${SERVERS[@]}"; do
[[ "$s" != "$CURRENT" ]] && AVAILABLE+=("$s")
done
BEST_SERVER="''${AVAILABLE[$RANDOM % ''${#AVAILABLE[@]}]}"
fi
echo "$BEST_SERVER" > /var/lib/wg-failover/current
ln -sf "/run/secrets/wireguard/$BEST_SERVER" /run/wg-active.conf
systemctl restart wg.service
'';
};
The failover service skips the currently failing server and picks the healthiest alternative. If Prometheus is unreachable, it falls back to random selection.
Health Check Timer
A timer runs every minute to check the VPN’s actual packet loss. It uses consecutive failure counting to avoid flapping:
systemd.services.wg-health-check = {
description = "VPN health check - trigger failover on sustained packet loss";
after = [ "wg.service" ];
path = [ pkgs.curl pkgs.jq pkgs.bc ];
serviceConfig.Type = "oneshot";
script = ''
set -euo pipefail
STATE_FILE="/var/lib/wg-failover/health-failures"
PROMETHEUS_URL="https://prometheus.example.com"
THRESHOLD="0.15" # 15% packet loss
MAX_FAILURES=3 # Trigger after 3 consecutive failures
LOSS=$(curl -sk --max-time 5 "$PROMETHEUS_URL/api/v1/query" \
--data-urlencode 'query=1 - (rate(smokeping_response_duration_seconds_count{job="vpn-prober"}[5m]) / rate(smokeping_requests_total{job="vpn-prober"}[5m]))' \
| jq -r '.data.result[0].value[1] // "1"')
if (( $(echo "$LOSS > $THRESHOLD" | bc -l) )); then
FAILURES=$(cat "$STATE_FILE" 2>/dev/null || echo 0)
FAILURES=$((FAILURES + 1))
echo "$FAILURES" > "$STATE_FILE"
if [[ $FAILURES -ge $MAX_FAILURES ]]; then
echo "0" > "$STATE_FILE"
systemctl start wg-failover.service
fi
else
echo "0" > "$STATE_FILE"
fi
'';
};
systemd.timers.wg-health-check = {
wantedBy = [ "timers.target" ];
timerConfig = {
OnBootSec = "2min";
OnUnitActiveSec = "1min";
RandomizedDelaySec = "10s";
};
};
Three consecutive checks above 15% packet loss triggers failover. A single successful check resets the counter.
Triggering Failover on Service Failure
The WireGuard service itself triggers failover when it fails to start (e.g., DNS resolution fails at boot):
systemd.services.wg.serviceConfig = {
Restart = "on-failure";
RestartSec = "30s";
};
systemd.services.wg.unitConfig.OnFailure = "wg-failover.service";
Monitoring All VPN Endpoints (External Prober)
The external prober runs on the hypervisor, outside any VPN namespace. It monitors all VPN server endpoints directly:
services.prometheus.exporters.smokeping = {
enable = true;
hosts = [
# General internet connectivity
"8.8.8.8"
"1.1.1.1"
# All VPN server endpoints
"203.0.113.10" # server1
"203.0.113.20" # server2
"203.0.113.30" # server3
];
};
This gives you continuous packet loss metrics for every server. When the failover script queries Prometheus for smokeping_response_duration_seconds_count{host="203.0.113.20"}, it’s using data from this prober.
Measuring Tunnel Quality (Internal Prober)
The internal prober runs inside the VPN namespace and measures actual tunnel quality by pinging through it:
systemd.services.prometheus-smokeping-vpn = {
after = [ "wg.service" ];
wants = [ "wg.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
ExecStart = "${pkgs.prometheus-smokeping-prober}/bin/smokeping_prober --privileged --web.listen-address=0.0.0.0:9374 198.41.0.4 192.33.4.12";
NetworkNamespacePath = "/var/run/netns/wg";
AmbientCapabilities = [ "CAP_NET_RAW" ];
Restart = "always";
};
};
It pings root DNS servers (a.root-servers.net, c.root-servers.net) through the active tunnel. The health check service queries this prober’s metrics to detect tunnel degradation.
Monitoring the Current Server
For visibility in Grafana, export the current server as a Prometheus metric:
systemd.services.wg-metrics = {
wantedBy = [ "wg.service" ];
after = [ "wg-select.service" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
CURRENT=$(cat /var/lib/wg-failover/current)
cat > /var/lib/prometheus-node-exporter/vpn_server.prom << EOF
# HELP vpn_server_info Current VPN server
# TYPE vpn_server_info gauge
vpn_server_info{server="$CURRENT"} 1
EOF
'';
};
Results
When a server becomes degraded, the system automatically switches within 3-4 minutes. The smokeping metrics in Grafana show exactly when failovers happen and why.
The key insight was using Prometheus as the source of truth for server health rather than trying to probe servers from the failover script itself. This means the failover decision is based on the same metrics used for alerting, and you can see the historical data that led to each switch.