Zevs @ zevs.gg

Building a Self-Hosted Infrastructure

Mar 6 · 12min

There’s a moment every developer hits — you’re paying $50/month for a handful of small VPS instances, your services are scattered across three different cloud providers, and you realize you have no idea where half your stuff is running. That was me two years ago. Today, everything runs on hardware I own, in a setup I fully control.

This post walks through how I built my self-hosted infrastructure from the ground up — the decisions, the stack, and the lessons I picked up along the way.

Why Self-Host?

The obvious answer is cost. Running a few services on cloud VPS instances is cheap, but once you start stacking databases, queues, monitoring, storage, and multiple apps — costs add up fast. A single bare-metal server can replace several cloud instances for a fraction of the recurring cost.

But cost isn’t the real reason. Control is. When you self-host, you own the data, you own the network, and you decide the rules. No vendor lock-in, no surprise pricing changes, no arbitrary rate limits. If something breaks, it’s on you — but at least you can actually fix it.

There’s also the learning aspect. Managing your own infrastructure teaches you things that no tutorial or managed service ever will. DNS propagation, firewall rules, disk I/O bottlenecks, certificate renewal failures at 3 AM — these are the experiences that make you a better engineer.

The Hardware Layer

Everything starts with Proxmox, an open-source virtualization platform built on top of Debian. It gives you a clean web UI for managing virtual machines and containers, with support for clustering, live migration, and backups out of the box.

I run Proxmox on a dedicated server with enough RAM and cores to comfortably host 20+ services. The storage backend is ZFS — a filesystem that handles compression, snapshots, and data integrity verification natively. ZFS snapshots are incredibly useful for rollbacks. Before any risky upgrade, I snapshot the dataset, and if things go wrong, I can roll back in seconds.

# Create a snapshot before upgrading
zfs snapshot rpool/data/myservice@pre-upgrade

# Something went wrong? Roll back instantly
zfs rollback rpool/data/myservice@pre-upgrade

For lightweight services, I use LXC containers instead of full VMs. LXC gives you near-native performance with process-level isolation — perfect for running databases, reverse proxies, or any service that doesn’t need a full kernel. The resource overhead is minimal compared to a VM.

My general rule: LXC for infrastructure services, Docker for application workloads. This keeps things clean and separable.

Container Orchestration

Most of my application workloads run in Docker containers, orchestrated with Docker Compose. For the scale I operate at, Compose hits the sweet spot — declarative, version-controlled, and simple enough that I can understand exactly what’s running without consulting a dashboard.

A typical service looks like this:

# docker-compose.yml
services:
  app:
    image: ghcr.io/my-org/my-app:latest
    restart: unless-stopped
    environment:
      DATABASE_URL: postgres://user:pass@db:5432/app
      REDIS_URL: redis://redis:6379
    labels:
      - traefik.enable=true
      - traefik.http.routers.app.rule=Host(`app.example.com`)
      - traefik.http.routers.app.tls.certresolver=cloudflare
    networks:
      - traefik
      - internal

  db:
    image: postgres:16-alpine
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    networks:
      - internal

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    networks:
      - internal

volumes:
  pgdata:

networks:
  traefik:
    external: true
  internal:

The pattern is consistent across all services: the app connects to an external Traefik network for ingress, while databases and caches live on an internal network that’s not exposed. Labels on the app container tell Traefik how to route traffic — no separate config files to maintain.

I also use Kubernetes for workloads that need horizontal scaling or more sophisticated scheduling. But honestly, for most self-hosted scenarios, Docker Compose is more than enough. K8s introduces significant operational complexity that’s only worth it when you genuinely need it.

Reverse Proxy & SSL

Traefik is the centerpiece of my ingress layer. It automatically discovers services via Docker labels, handles TLS termination, and manages certificate renewal — all without manual intervention.

The Traefik configuration is minimal:

# traefik.yml
entryPoints:
  web:
    address: ':80'
    http:
      redirections:
        entryPoint:
          to: websecure
          scheme: https
  websecure:
    address: ':443'

certificatesResolvers:
  cloudflare:
    acme:
      email: admin@example.com
      storage: /letsencrypt/acme.json
      dnsChallenge:
        provider: cloudflare

providers:
  docker:
    exposedByDefault: false
    network: traefik

api:
  dashboard: true

Every new service I deploy gets automatic HTTPS with zero extra configuration. I just add the Traefik labels to the Docker Compose file, and Traefik picks it up within seconds. The DNS challenge via Cloudflare means I can issue wildcard certificates and don’t need to expose port 80 for HTTP challenges.

Cloudflare also sits in front as a CDN and DDoS protection layer. DNS records point to Cloudflare, which proxies traffic to my server. This keeps my actual server IP hidden and adds an extra layer of caching for static assets.

Networking

This is where things get interesting. My network setup has evolved significantly over time.

At the edge, I run pfSense as my primary firewall. It handles VLAN segmentation, NAT, and firewall rules. I separate my network into multiple VLANs — management, servers, IoT devices, and guest traffic are all isolated from each other. A compromised IoT device shouldn’t be able to reach my server VLAN, and guest WiFi shouldn’t see anything on my internal network.

UniFi access points and switches handle the physical layer. The UniFi controller runs as an LXC container on Proxmox, managing all network hardware from a single interface. Say what you will about Ubiquiti’s pricing, but the management experience is hard to beat for a home/small office setup.

For remote access, Tailscale is a game-changer. It creates a WireGuard-based mesh VPN that connects all my devices — laptops, phones, servers — into a single private network, regardless of where they physically are. No port forwarding, no dynamic DNS, no VPN server to maintain.

# Access my home server from anywhere
ssh user@server  # Just works, over Tailscale

# Access internal services without exposing them publicly
curl http://grafana:3000  # Only accessible via Tailscale network

Services that don’t need to be public — like Grafana, Proxmox UI, or internal admin panels — are only accessible through Tailscale. This drastically reduces the attack surface. The only ports exposed to the internet are 80 and 443, both behind Cloudflare.

Monitoring & Observability

Running your own infrastructure without monitoring is like driving at night with the headlights off. Prometheus scrapes metrics from every service, and Grafana turns those metrics into dashboards I can actually understand.

Every Docker host runs node_exporter and cadvisor for system and container metrics. Application services expose custom Prometheus endpoints where relevant. Prometheus collects everything and stores it with configurable retention.

# prometheus.yml
scrape_configs:
  - job_name: node
    static_configs:
      - targets:
          - 'node-exporter:9100'

  - job_name: cadvisor
    static_configs:
      - targets:
          - 'cadvisor:8080'

  - job_name: traefik
    static_configs:
      - targets:
          - 'traefik:8080'

I have Grafana dashboards for CPU/memory/disk usage, container health, network throughput, and Traefik request rates. AlertManager sends notifications to Telegram when something goes wrong — disk usage above 85%, a container restarting in a loop, or Traefik returning too many 5xx errors.

The monitoring stack itself runs on a separate LXC container so it stays up even if the Docker host has issues. You don’t want your monitoring to go down at the same time as the thing it’s monitoring.

Backup Strategy

The one thing I’ve learned the hard way: backups that aren’t tested are not backups.

My strategy is layered:

  1. ZFS snapshots — automatic hourly snapshots with 7-day retention. Instant rollback for filesystem-level issues.
  2. Application-level backups — PostgreSQL pg_dump runs nightly via cron, compressed and stored on a separate ZFS dataset.
  3. Off-site replication — critical data is synced to MinIO on a separate machine using rclone, and the most important stuff goes to an off-site location.
# Nightly database backup via cron
0 3 * * * pg_dump -Fc mydb > /backups/mydb-$(date +\%Y\%m\%d).dump
# Sync to MinIO
0 4 * * * rclone sync /backups minio:backups --min-age 1h

I test restores quarterly. It’s tedious, but the one time you need a backup and it doesn’t work, you’ll wish you had tested it.

Provisioning with Ansible

When I started, I configured everything manually — SSH in, install packages, edit config files. That works for one server. It doesn’t work when you need to rebuild or replicate.

Ansible handles all server provisioning now. Every package, every config file, every cron job is defined in playbooks. If my server dies tomorrow, I can spin up a new Proxmox host and have everything running again by executing a single command.

ansible-playbook -i inventory site.yml

The playbooks cover base system setup, Docker installation, Traefik configuration, monitoring stack deployment, firewall rules, and user management. It’s not glamorous work, but it’s the difference between a one-hour recovery and a two-day scramble.

Lessons Learned

After running this setup for a while, a few things stand out:

Start simple. Don’t build the perfect infrastructure on day one. Start with Docker Compose on a single server. Add complexity only when you hit real limitations, not imagined ones.

Automate early. The second time you manually configure something, write an Ansible playbook for it. Your future self will thank you.

Network segmentation matters. VLANs and firewall rules feel like overkill until you have a security incident. It’s much easier to set up segmentation from the start than to retrofit it later.

Monitor everything, alert selectively. Collect all the metrics you can, but only alert on things that require immediate action. Alert fatigue is real and dangerous.

Document your setup. Not for others — for yourself in six months when you’ve forgotten why that one iptables rule exists. I keep a private wiki with network diagrams, service inventories, and runbooks for common operations.

What’s Next

The infrastructure is never really "done." I’m currently exploring moving more workloads to Kubernetes for better resource utilization and looking into GitOps workflows with Flux or ArgoCD for automated deployments. I’m also considering adding a secondary node for Proxmox clustering to enable live migration and high availability.

But for now, this setup handles everything I throw at it — multiple web apps, databases, queues, monitoring, and storage — all on hardware I own, running software I control. It’s not perfect, but it’s mine.

If you’re thinking about self-hosting, my advice is simple: just start. Pick one service you’re currently paying for in the cloud, spin up a cheap used server or a mini PC, and move it over. You’ll learn more in a weekend than in a month of reading documentation.

Thanks for reading!

> comment on bluesky / mastodon / twitter
>