Skip to main content
  1. Blog/

What an AI agent actually did in my homelab rebuild

I rebuilt my homelab over the last two months. Four Proxmox nodes became two VergeOS clusters. 35 devices inventoried, full monitoring stack deployed. Claude Code handled most of the planning, config generation, and many of the execution steps. I handled validation, approvals, and anything that touched physical hardware.

That division of labor, and what made it actually reliable, is the interesting part.

What I started with
#

A 4-node Proxmox cluster called “midgard,” two Unraid boxes (Frank and LilNasX), a PBS backup server, and UniFi network gear. 16 running VMs and LXCs, 8 stopped. Immich maxing RAM. Dead containers on Unraid. The standard homelab entropy that builds up when you’ve been adding things for years without taking anything away.

In early February I set up Uptime Kuma with 26 monitors as a first pass at visibility. Sonarr, Radarr, Plex, Calibre, the whole *arr stack. Documented all the API keys, infrastructure IPs, Proxmox tokens. The question at that point was simple: what do we even have?

A lot more than I thought, and a lot less under control than I wanted.

Inventory and architecture
#

In mid-March, Claude SSH’d into all four Proxmox nodes and built a hardware inventory: every drive (serial numbers, form factors, capacities, slot positions), every NIC, MAC address, PCIe card. All documented in SERVERS.md with enough detail to plan the migration down to which U.2 drive moves from which slot to which node.

Once the inventory was complete, Claude generated a migration target I could actually evaluate: two 2-node VergeOS clusters with symmetric storage tiers and direct-connect core networking. A prod cluster on Supermicro EPYC hardware with 25GbE Mellanox fabric and GPUs, and a dev/QA cluster on Minisforum Ryzen 9 boxes with 10GbE Intel NICs.

The storage design needed symmetric drive counts per vSAN tier on both clusters. The prod cluster got a 4-tier layout: enterprise M.2 for metadata, dual Intel P5520 7.68TB U.2s for performance, Samsung PM9A3 for mixed workloads, and HGST 12TB HDDs for capacity. I had to physically move a PM9A3 from PVE3 to PVE4 and shelf about a dozen drives that didn’t fit the new layout.

The runbook came out to 830 lines. Detailed enough that I could shut a node down, pull a specific U.2 drive, and know exactly where it belonged in the new cluster without improvising. Claude generated it from the inventory data. I verified every assignment against the physical hardware in front of me. Those are different jobs, and neither one works without the other.

The quorum problem
#

This is where having an agent with context about the full system paid off.

Proxmox cluster “midgard” was 4 nodes. Quorum requires a majority. Pulling PVE3 and PVE4 out to wipe them for VergeOS leaves PVE1 and PVE2 with 2 of 4 votes. Not a majority. The cluster locks up. VMs can’t start. Storage goes read-only. Everything stops.

Claude caught this before I powered anything down. The fix: pvecm delnode pve3 and pvecm delnode pve4 from PVE1 first, shrinking the cluster to 2 nodes where 2 of 2 is quorum. Before that, disable HA services so the cluster doesn’t try to migrate VMs to nodes about to disappear.

There was also a data migration issue. About 2.5TB of ZFS datasets needed to move off PVE3 before it got wiped. The plan was simple: export /tank over the 10GbE cluster bond and rsync it to PVE1. Except PVE1 was sourcing its NFS traffic from the LAN IP instead of the cluster IP, so traffic was crawling over 1GbE instead of 10GbE. Had to fix the routing before the transfer would run at speed.

These are the kind of problems that eat an entire Saturday if you don’t catch them first.

GPU server and the VRAM wall
#

Once VergeOS was live on the prod cluster, Claude wrote a .vrg.yaml template with inline cloud-init for a GPU server VM: Docker, nvidia-driver-570-server, the NVIDIA Container Toolkit, sysctl tuning, guest agent. One vrg vm create -f gpu-server.vrg.yaml and the VM was running.

Then I tried vLLM with Qwen3.5-27B-AWQ on the 3090 Ti. The model weights alone eat 14-15GB on a 24GB card. 262K context was never going to fit. Claude calculated the KV cache budget at FP16, estimated total VRAM at different context lengths, and identified that Docker’s overhead was eating into the margin. Dropped to 32K context, switched from Docker to a bare-metal venv with uv, and it served. The agent was useful for the arithmetic. The tradeoff decision (which context length to accept, whether the runtime switch was worth the operational complexity) was mine.

Teaching the agent the CLI
#

I use vrg, a Python CLI for VergeOS. The flag names are non-obvious. --cpu-cores is actually --cpu. --direction inbound is actually --direction incoming. RAM takes an integer in MB, not a string. Firewall and DNS changes are staged until you run apply-rules or apply-dns.

Every one of those quirks is a wasted API call when Claude guesses wrong. So I built a 311-line skill definition that teaches any Claude session how to use every command correctly, plus a cookbook with tested recipes and a templates reference for the .vrg.yaml format.

Then I wrote evals: three representative prompts covering VM creation, firewall/DNS configuration, and diagnostic troubleshooting. Each prompt had 5-6 explicit expectations (correct flags, correct sequence, no destructive operations on read-only tasks). I ran them with and without the skill loaded:

Category With skill Without skill
VM creation 6/6 (100%) 2/6 (33%)
Firewall + DNS 6/6 (100%) 1/6 (17%)
Troubleshooting 5/5 (100%) 2/5 (40%)
Overall 17/17 (100%) 5/17 (29%)

The first eval round surfaced real flag errors, which is the whole point. You write the evals, the agent fails, you read why it failed, you fix the skill, you re-evaluate. Three iterations to 100%.

One example: the firewall eval asked Claude to allow inbound HTTPS and add a DNS A record. Without the skill, it used --direction inbound (wrong), forgot apply-rules, and passed the DNS zone as a flag instead of a positional argument. With the skill, every flag and sequence was correct.

After that, operations were clean. vrg vm list --status running. vrg vm stop mon. vrg vm start mon. 28-second cycle from poweroff to guest agent reconnect.

That 29% to 100% result is the clearest evidence in this whole rebuild for why durable tool knowledge matters more than model capability alone.

Network discovery
#

April 5th started with a full inventory. Ping sweep found 26 hosts. UniFi reported 51 clients plus 11 infrastructure devices. Claude SSH’d into every server, catalogued all VMs and containers.

Final count: 35 physical devices, 14 VMs, roughly 60 containers.

This was also where I had to correct the agent. Claude flagged Frank at 97% RAM utilization as a problem. It’s Linux. Of course it’s using all the RAM for cache. And it flagged LilNasX as “actually a VergeOS VM running Unraid” like it had uncovered something strange. I run Unraid as a VM on purpose. Both were cases where Claude’s model of normal didn’t match mine, and without the correction it would have generated unnecessary remediation steps.

The discovery phase was still the first place the agent clearly saved me hours. SSH’ing into a dozen hosts, normalizing the output, and cross-referencing against UniFi is tedious work that machines should do. Claude ran it in one session while I spot-checked the results and corrected its assumptions.

The monitoring marathon
#

Same day. The entire monitoring deployment plan, 15 steps, all in one session. 24 out of 24 Prometheus targets UP by end of day.

Claude proposed the monitoring stack and did most of the deployment work. For each host, it would SSH in, install the exporter or spin up the container, configure it, and update the Prometheus scrape config on the mon VM. node-exporter on 7 hosts, cAdvisor on 5 Docker hosts, IPMI exporter scraping both Supermicro BMCs, UnPoller for UniFi, SNMP exporter for the MikroTik 10G switch. The agent was fastest on this kind of repetitive setup work, where the steps are mechanical but there are a lot of them. The value showed up when the routine broke.

The GPU exporter was the first thing that broke. v1.3.0 choked on newer nvidia-smi output because of brackets in metric names. Claude’s instinct was to script around it, parsing and cleaning the output. I interrupted and told it to just upgrade to a newer version. v1.4.1 had AUTO query mode that handled the format change natively. Installed as a systemd service, problem gone. Sometimes the agent reaches for a workaround when the real fix is simpler.

Syslog was trickier. I told Claude that VergeOS was sending hostname as just “node1” or “node2” with no cluster context, and with two clusters that meant two node1s with no way to tell them apart in Loki. My working directory already had the VergeOS docs in it, so Claude read those, cross-referenced the rsyslog template syntax, and wrote custom templates with cluster prefixes (midgard-node1, devtest-node1). I described the problem, the docs were already in the working directory, and the fix came back in seconds as a usable rsyslog template instead of a vague suggestion. That’s the pattern when it works well: fast execution on problems with clear signals.

Then there was another gotcha. RFC 3164 timestamps don’t include the year. Alloy was interpreting them as epoch 0. use_incoming_timestamp = false forces Alloy to use its own receive timestamp instead.

Each of these followed the same loop: I’d hit a problem, describe what I was seeing, and Claude would dig into the docs and error output to produce a fix.

By end of day, I had 16 Grafana dashboards, Alloy shipping container logs from all Docker hosts to Loki, centralized syslog from the VMs, VergeOS nodes, and UniFi gateway, and NetBox as the authoritative CMDB with all devices and VMs imported.

What I still had to do
#

The agent doesn’t replace you. It compresses the grind.

I still had to define the goal for each session, grant access, and verify that the runbook matched the hardware in front of me. I approved every destructive step (cluster node removal, ZFS dataset deletion, service restarts). I did the physical work: moving drives, recabling nodes, validating topology after the inventory, spot-checking monitoring targets after deployment. And I made the tradeoff calls, like accepting 32K context instead of 262K, or choosing a vSAN tier layout that balanced performance against the drives I actually had.

The agent also made mistakes beyond flag syntax, and they fell into recognizable patterns:

Environment-model mismatch. It occasionally assumed a service was running in Docker when it was actually a systemd service, or vice versa. It builds a mental model from what it’s seen, and sometimes that model is stale or wrong.

Default-based guessing. Prometheus scrape configs sometimes had wrong port numbers, copied from common defaults instead of checked against the actual host. I caught these during validation.

Access blind spots. On the NFS routing issue, it proposed the right fix but couldn’t test it because it didn’t have access to the network layer on PVE1. It knew what should work. It couldn’t confirm that it did.

None of these were catastrophic. All of them would have been if I’d blindly applied the output.

Across 11 working sessions from February through early April, the pattern was consistent: I’d set the goal, then go do something else. Laundry, Walking Dead, other projects. Claude would grind through the repetitive work, and I’d check in periodically to answer questions, correct assumptions, or approve the next step. The value wasn’t that the agent was faster than me. It was that the work kept moving while my attention was somewhere else.

What made it work
#

Three things turned the agent from a chatbot into something I could actually trust with infrastructure:

Persistent context. Claude had SSH access and could inspect the environment directly, instead of hallucinating from half-remembered docs and my vague descriptions. It knew which VMs were running, which ports were in use, which drives were in which slots.

Explicit tool knowledge. The vrg CLI skill stopped it from improvising command syntax and turned repeated failure into repeatable success. Before the skill, 71% of CLI operations failed. After it, zero did.

Eval-driven feedback. Writing evals, running them, reading failures, and fixing the skill created a real feedback loop. The agent got better because I measured where it was bad and fixed the inputs. That process took one afternoon and paid for itself on every session after.

The lesson isn’t that the agent was magic. It’s that once it had durable context, tested command knowledge, and access to the real environment, it stopped behaving like autocomplete and started behaving like an operator.

The result wasn’t autonomy. It was leverage: the agent turned inventory, config generation, and first-pass troubleshooting into cheap work, so I could spend my time on validation, tradeoffs, and the parts that actually required judgment.

Related