Homelab

A single NixOS server handling routing, DNS, storage, and services for the household. Stats are live, pulled from Prometheus every 30 seconds.

Overview

This started as a Proxmox box for a media server and home automation, then grew into a proper router/firewall when I wanted more control over network segmentation. Today it's a single machine that does everything: terminates the ISP fiber via PPPoE, routes between VLANs, serves DHCP and DNS, runs VMs for various services, and acts as a NAS.

The entire system is configured declaratively with NixOS. Every service, firewall rule, VLAN, and DNS record is defined in Nix files and version-controlled. Deploying a change is nixos-rebuild switch, and rolling back is just as easy. The MikroTik switch config is also generated from the same Nix codebase and deployed atomically with automatic rollback if connectivity is lost.

I chose to consolidate everything on one box rather than separate router/NAS/hypervisor machines. It's simpler to manage, uses less power, and with proper VLAN segmentation the security tradeoffs are acceptable for a home environment. The server has enough headroom that I've never felt resource-constrained.

Live Stats

Hardware

The server is an ASRock Rack 1U4LW-X570/2L2T with a Ryzen 7 PRO, 64GB ECC RAM, and dual 10GbE. IPMI provides out-of-band management on a dedicated NIC. An Intel Arc A310 handles hardware transcoding, passed through to a media VM via VFIO.

Storage is all ZFS with native encryption: NVMe mirror for boot, SSD mirror for VMs, and a RAIDZ1 HDD pool with hot spare for bulk data. Secure Boot is enabled with custom keys. The MikroTik switch handles VLAN trunking with 10G uplinks to the server.

For detailed specs, see the homelab section on /uses.

Network

The server terminates a 1 Gbps / 600 Mbps fiber connection via PPPoE and handles all routing. Traffic shaping with CAKE keeps latency low even under load—bufferbloat is essentially eliminated. IPv6 runs dual-stack with both a stable ULA prefix for internal addressing and delegated public prefixes from the ISP.

VLAN Segmentation

Zone	Purpose	Policy
clients	Trusted devices	Full internet, access to services
guest	Guest WiFi	Internet only, isolated
iot	Smart home devices	MQTT/HA only, no internet*
infra	VMs and services	Internal only, via reverse proxy
public	Neighbor/shared access	VPN egress only
mgmt	Infrastructure management	IPMI, switches, admin access

* IoT devices that need cloud connectivity (firmware updates, etc.) get allowlisted per-device.

Each VLAN is a separate broadcast domain with its own IP range. The server acts as the gateway for all of them, with firewall rules controlling what can talk to what. The goal is defense in depth: even if an IoT device gets compromised, it can't reach anything except the services it needs.

The "public" VLAN is interesting—it's for sharing internet access with neighbors without giving them access to my network. All traffic from that VLAN is forced through a commercial VPN, so it's completely isolated and doesn't use my public IP.

Firewall

The firewall is nftables with a zone-based policy. Each VLAN maps to a zone, and rules define allowed flows between zones. Everything is default-deny—traffic is dropped unless explicitly permitted. The ruleset is generated from the same config that defines VLANs, so they can't get out of sync.

Some key policies: IoT can only reach MQTT and home automation on specific ports. Guests can only reach the internet. The infra zone isn't directly accessible from clients—everything goes through the reverse proxy. Management is only reachable from the management VLAN or via Tailscale.

IPv4 uses NAT for internet access. IPv6 is properly routed with stateful filtering—no NAT66, but unsolicited inbound traffic is dropped by default. Services that need to be reachable get explicit allow rules.

DNS & DHCP

Kea handles DHCP for both IPv4 and IPv6. Each VLAN gets its own pool with appropriate lease times—short for guests, long for IoT devices that misbehave with frequent renewals. Kea pushes hostname updates to the DNS server automatically, so devices are resolvable by name within minutes of connecting.

Knot DNS is the authoritative server for the internal domain. It receives dynamic updates from Kea and serves both forward and reverse lookups. The zone files are stored on disk and survive reboots.

Blocky is what clients actually query. It's a DNS proxy with ad blocking (AdGuard lists, StevenBlack, etc.) and forwards to Cloudflare/Quad9 via DNS-over-TLS. Different VLANs get different blocking levels—full filtering for clients, malware-only for IoT. Internal domains are forwarded to Knot for resolution.

Ingress

All HTTP/HTTPS traffic—local, remote, public—flows through a single reverse proxy. This centralizes TLS termination, access control, and routing decisions in one place.

Reverse Proxy

Caddy is the central ingress point for all services. Every HTTP/HTTPS request—whether from the local network, Tailscale, or the public internet—goes through Caddy. It handles TLS termination for everything, using automatic certificate management via Cloudflare DNS-01 challenges. This means even internal services get valid HTTPS certificates.

Services are accessed by proper hostnames, never IP addresses. DNS resolves these hostnames to either the internal reverse proxy IP (for LAN/Tailscale access) or the public IP of the external VM (for internet access). Because the homelab advertises its subnets via Tailscale, remote devices can resolve and reach the internal reverse proxy directly—traffic stays on the Tailscale mesh without hitting the public internet.

Each service is configured with fine-grained access control via subnet ACLs. The reverse proxy config specifies which VLANs can reach a service and whether it should be accessible via Tailscale, the public internet, or both. For example:

A media server might be accessible from the clients VLAN and Tailscale, but not the public internet
A monitoring dashboard might be public but require SSO authentication
An internal admin panel might only be reachable from the mgmt VLAN

This per-service configuration is defined declaratively in Nix. Adding a new service means specifying its backend address and which access methods should be allowed. Caddy's config is generated automatically from these declarations, including the appropriate ACL rules.

For services that need authentication but don't have their own, Caddy integrates with an SSO provider via forward auth. Requests are validated before being proxied to the backend.

Public Ingress

My ISP connection has dynamic IPv4 and a dynamic IPv6 prefix—not ideal for hosting anything publicly. Instead, I run a small external VM at a cloud provider that has stable public IPv4 and IPv6 addresses.

This VM connects to my homelab via Tailscale. It runs HAProxy to forward incoming HTTP/HTTPS traffic to Caddy on the homelab. HAProxy uses the PROXY protocol, which preserves the original client IP address through the tunnel. Without this, Caddy would only see the Tailscale IP of the external VM.

The external VM is intentionally minimal—it's just a forwarder. All the actual TLS termination, routing logic, and authentication happens on Caddy inside the homelab. The VM's firewall only allows inbound traffic on ports 80 and 443, plus Tailscale.

This setup gives me stable public endpoints without exposing my home IP or dealing with dynamic DNS. If the homelab goes down, the external VM just returns connection errors—it has no state or data of its own.

Tailscale

Tailscale provides secure access to my home network from anywhere. The server advertises routes to internal subnets, so my laptop or phone can reach any device on the network as if I were home. This includes the management VLAN, which isn't reachable any other way from outside.

The homelab also acts as an exit node—I can route all traffic through home when I'm on untrusted networks. It functions as a relay for other Tailscale nodes too, helping with connectivity when direct connections aren't possible.

Tailscale's direct connection support means traffic usually goes peer-to-peer without bouncing through relay servers. When I'm on a good network, latency to home services is typically just a few milliseconds more than local access.

The external VM also advertises Cloudflare's IP ranges via Tailscale subnet routing. My ISP has notoriously bad peering with Cloudflare, so routing that traffic through the VM bypasses the congestion entirely.

The boot process supports remote unlock via SSH into the initrd. If the server reboots while I'm away, I can SSH in over Tailscale and provide the ZFS encryption passphrase without physical access. The initrd brings up networking and Tailscale before prompting for the key.

Software

The entire stack is declarative and version-controlled. Configuration defines the system, not the other way around.

NixOS

The entire system is NixOS, which means every aspect of the configuration—packages, services, users, firewall rules, systemd units—is declared in Nix files. There's no imperative state to drift. If I want to know how something is configured, I read the config files. If I want to change something, I edit the files and rebuild.

Deployments are atomic. The system either fully switches to the new configuration or stays on the old one. There's no half-applied state. Every configuration is a "generation" that I can boot into from the bootloader, so rolling back a bad change is trivial.

The config lives in a git repo alongside the configs for my other machines. Changes go through the normal code review process in my head, and I can see the full history of what changed when. For risky changes, I test in a VM first—NixOS can build a QEMU VM from the same config.

Virtualization

Services run in MicroVMs—lightweight NixOS virtual machines using QEMU with virtio. Each VM shares the host's Nix store via virtiofs, so they don't need their own copy of packages. A typical VM uses maybe 100MB of disk for its unique state.

MicroVMs boot in seconds and are defined declaratively alongside the host config. Adding a new service means adding a new microvm block to the Nix config and rebuilding. The VMs get their own IP on the infra VLAN and are reachable via the reverse proxy.

For things that need more isolation or aren't NixOS-native, there's also a small k3s cluster running in one of the VMs. It handles experimental workloads and anything that ships as a container image.

Monitoring

Everything is instrumented. Metrics, logs, and alerts flow into a central stack that serves as the single source of truth for system state. This isn't just for dashboards—the APIs are queryable by automation and agents, which can correlate live metrics with the declarative Nix configs to understand both current and desired state.

Metrics & Logs

Prometheus scrapes metrics from everything: node exporters on each VM, ZFS stats, systemd service status, the UPS, network devices via SNMP, and application-specific exporters. Data is retained for several months.

Loki aggregates logs. Promtail ships systemd journal entries from every host, and the switch/APs send syslog. Logs are searchable in Grafana alongside metrics, which makes correlating issues much easier.

Grafana provides dashboards for everything. I have views for system overview, ZFS health, network traffic per VLAN, DNS query rates, and per-service dashboards. The live stats on this page come from a small Go service that queries Prometheus and exposes a public-safe subset.

Alerting

Alertmanager handles alert routing. Critical alerts (ZFS degraded, service down, UPS on battery) go to my phone immediately. Less urgent things (disk space warnings, high CPU) get batched into summaries.

A status page monitors health checks for every service—HTTP endpoints, TCP connectivity, DNS resolution, and certificate expiry. If something's broken, I usually know before anyone complains.

Backups

zrepl handles ZFS replication. Critical datasets on the SSD pool replicate to the HDD pool, protecting against SSD failure. The HDDs have more space and can keep longer snapshot history.

My laptop also backs up to the server via zrepl over Tailscale. Incremental ZFS snapshots transfer only changed blocks, so even on slow connections backups complete quickly after the initial sync.

Snapshot retention is configured per-dataset based on how critical the data is. Important things keep more history, bulk data keeps less. The HDD pool has enough space to keep months of history for the critical stuff.

The third stage—offsite backup—is still on the TODO list. The plan is to replicate to a remote location, but the local redundancy handles the common failure modes for now.