Sunday, April 19, 2026

The Homelab that just kept growing(part 2 of 2 ) :: The Team Grew. So Did the Architecture.

The Team Grew. So Did the Architecture.

Three layers, five agents, and the strange experience of watching the industry ship what I built in my spare time.


Where Part One Left Off

In Part One I talked about Deborah — the hardware, the 16GB GPU constraint, and why the memory problem drove me to build ai_memory, a MySQL + Neo4j brain that lets multiple AI agents share context across sessions. That post ended with a working system: Kimi CLI and Claude Code pulling shared rationale out of a database, local scripts monitoring the homelab and clearing spam, and me spending less time repeating myself to machines.

The multi-agent side of this started earlier than the December post suggests. Back in August 2025 I was already writing about why treating a single model as the answer to everything was the wrong instinct — diversity of perspective, not a bigger box. The GitLab implementation document for the IRC bot team is from that same month. December was when I published the full write-up: four specialized agents on a private IRC server, writing code, reviewing it, blocking execution until two independent perspectives agreed. The work predates that post by several months.

By early 2026, the commercial landscape had started converging on the same multi-model pattern. Blackbox AI CLI Agents let you run multiple AI providers — Google, Anthropic, OpenAI — in parallel on the same task, with an internal judge picking the best implementation. Yupp.ai launched with 1.3 million users around the same idea: "the world's smartest AIs, side by side with you." Both are closer to the multi-model thinking than anything I was aware of in August 2025.

I should note: as of March 2026, Yupp.ai announced it's shutting down. Their explanation is worth reading — they wrote that the industry has "shifted dramatically toward agentic systems: models connected to tools, memory, and external services, rather than standalone chatbots," and that passive side-by-side comparison didn't fit where AI was going. They're right. That's the exact pivot I made two years ago.

But closer is not the same. Blackbox runs models in competition — parallel submissions, pick a winner. Apocryia Forge runs models in adversarial collaboration — one builds, one challenges, and the Critic must explicitly approve before anything happens. Blackbox gets you the best first draft. Forge gets you code that has been beaten into shape across up to 25 refinement cycles. Their judge selects. Our Critic vetoes. Those are different philosophies, and only one of them assumes the AI is wrong until proven otherwise. The other practical difference: Blackbox requires their cloud. Every task routes through remote VMs. Forge runs entirely on Deborah — my code never leaves the network.

This post is what happened after. The team has a cleaner design now — five agents with sharper role definitions, a security model I can explain in a single sentence, and a framing I didn't have before: three layers, each doing exactly one job. I'll also get into something I didn't expect: by February 2026, the industry had independently shipped some of the same architecture concepts I'd been building and working on already since the summer. The concept of IRC could have been off-putting to some, but the big picture likely was not in full focus, even for me fully. IRC is just an interface to interact. You can use Slack, email, Matrix, a web API — the interface is irrelevant. The team is the point.


Three Layers, One Clear Job Each

The architectural clarity came from asking a question I should have asked earlier: what is IRC actually for in this system? In the original proof of concept, IRC was doing too many jobs. It was the communication medium, the coordination layer, the state machine, and the audit display simultaneously. That's not good design. It made everything depend on everything else.

The v2.0 framing separates three concerns that were previously tangled together.

Layer 1 — The Mind: ai_memory (MySQL + Neo4j)

This is where actual work happens. Agents don't coordinate by chatting in IRC — they coordinate by writing to and reading from the database. The ai_comms table is the real message bus. Each agent has dedicated tables for its outputs: developer_code, reviewer_scores, lucifer_challenges, shell_executions. When a task runs, the database is working. Every decision, every quality score, every approval, every lesson learned from a past failure — it's all in there, queryable, and available for recall at inference time. I'm a database guy. This is a database problem. The memory doesn't belong in a model's context window; it belongs in a schema.

Layer 2 — Cognition: Ollama Models on the Local GPU

The thinking happens here. Each agent's turn is a model inference call. The Coordinator decides how to route a task. The Builder generates a solution. The Judge evaluates quality. The Critic challenges assumptions. The models don't hold state between turns — they think, output, and unload from GPU memory. Persistence lives in Layer 1, not here. This separation is also why model selection per role matters: you want the cognitive style right for each job, not one model pretending to be five different things.

Layer 3 — Display: Any Interface

The display layer shows you what's happening. IRC was the first implementation — think of it like ticket status updates in a project tracker. You connect with HexChat, and you see phases scroll by: RECEIVED, Analysis, Building, Score: 9/10, APPROVED, COMPLETED. But this layer is just transport. The same status updates flow through Slack, email, a web API, or MCP with identical behavior underneath, because the work happens in Layers 1 and 2. The display layer doesn't coordinate anything — it observes and reports.

ai_memory (MySQL + Neo4j)   ← THE MIND       (work and coordination)
Ollama Models on GPU        ← COGNITION      (thinking and deciding)
IRC / Slack / MCP / API     ← DISPLAY        (observability for humans)

The Five-Agent Collective

The original proof of concept had four agents: ProjectManager, Developer, Reviewer, and Lucifer. The v2.0 architecture adds a fifth: Shell — renamed Hands in the internal design — the only agent that actually executes anything against the system. That isolation matters more than the count.

Agent Personality Model Role
Coordinator ProjectManager Qwen2.5-Coder 7B Routes tasks, recalls relevant memory, coordinates the team
Builder Developer DeepSeek-Coder-V2 Generates code and solutions, iterates on feedback
Hands Shell Executor DeepSeek/Qwen The only agent that executes commands — whitelisted, logged, audited
Judge Reviewer Custom fine-tune Scores quality 1–10, requires ≥8 to pass
Critic Lucifer Mistral Dedicated adversary — must explicitly output "APPROVED" for execution to proceed

You can talk to any agent directly, or route through the full team. Here's what that looks like in practice. A direct request to the Developer only — no review cycle, just gets done:

keith: @dev write a hello world python script
AI-Team: 👨‍💻 **Builder**: Working on request from keith...
AI-Team: 👨‍💻 **Builder**: Done.

Like going over to a developer teammate and asking directly. No full review, but fast. Now route the same request through the whole team:

keith: @pm write a Python script to say hello world
AI-Team: 📋 **Task Queued** | ID: TASK-20B3B1CF | Type: code
AI-Team: 1️⃣ Builder generates → 2️⃣ Judge reviews (≥8 required) → 3️⃣ Critic challenges/approves → 4️⃣ Code saved
AI-Team: 👨‍💻 Builder: Working on TASK-20B3B1CF...
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - straightforward, functions correctly
AI-Team: 😈 Critic: TASK-20B3B1CF APPROVED - ready for execution
AI-Team: 📦 Code saved to: generated_code/TASK-20B3B1CF.py
AI-Team: ✅ TASK-20B3B1CF COMPLETED

The real test is a task that pushes back. Here's a MySQL backup script that required eight iterations before passing both gates — the full iteration count is in the database, the IRC display just shows the status transitions:

keith: @pm write a bash script to backup MySQL databases
AI-Team: 📋 Task Queued | ID: TASK-5749EA2E
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 7/10 - NEEDS_WORK - basic solution, missing error handling...
AI-Team: 🔄 Coordinator: iteration 1/25 - refining
AI-Team: 👨‍💻 Builder: Completed (iteration 2) - ready for review
AI-Team: 👀 Judge: Score 6/10 - NEEDS_WORK - performs basic task but has several issues...
AI-Team: 🔄 Coordinator: iteration 2/25 - refining
[iterations 3-7: scores cycling 7/10, Judge flagging missing secure credential handling,
 incomplete error recovery, cleanup logic gaps]
AI-Team: 👨‍💻 Builder: Completed (iteration 8) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - secure credential handling, proper error recovery...
AI-Team: 😈 Critic: TASK-5749EA2E APPROVED - ready for execution
AI-Team: 📦 Code saved | ✅ COMPLETED | iteration_count: 7

Seven refinement cycles on a backup script. The Judge kept finding gaps — hardcoded credentials, missing error handling, incomplete cleanup — and the Builder addressed each one. The final version loaded credentials from environment variables, handled connection failures, and cleaned up properly. The Judge's standard for ≥8/10 isn't decoration. When you want to hand this off overnight, the gate has to hold.

The team also handles edit tasks — not just new code, but surgical modifications to existing files. Request a change, Builder generates the diff, same Judge/Critic gate applies, Hands saves the modified file. The audit trail includes what was changed and why.


The Super Agent: When Local Isn't Enough

The Super agent extends the team beyond local models. When a task warrants the best available reasoning — or when you want multiple commercial APIs to weigh in simultaneously — you can route directly to Grok, Claude, or any configured provider without leaving the same interface.

The Super agent supports three multi-model modes:

CommandResult
@super ask @claude @grok <question>Show each model's answer separately for direct comparison
@super CONSENSUS <question>Synthesize one agreed answer from all responding models
@super compare <question>Highlight where models agree and where they diverge

The CONSENSUS mode is the most useful for code review: instead of asking each model separately and reconciling the answers yourself, you get a single synthesized response that surfaces disagreements inline. Same interface, local or cloud — the team doesn't care which inference backend answers a given question.


MCP Server: The Interface Most People Will Actually Use

IRC is the audit layer. It shows you what is happening, lets you interject, and preserves a human-readable log of every decision. But it is not how most developers will interact with Apocryia Forge on a daily basis. The primary interface is MCP — each agent exposed as a callable tool from any MCP-compatible CLI.

The underlying architecture is a hybrid: the core logic (TeamConsensusEngine) is the single source of truth, and each interface — IRC, MCP, HTTP API — is a thin adapter that translates its own protocol into the same core calls. This means Claude Code, Kimi CLI, or any other MCP-compatible tool gets identical behavior to what you see in IRC, because they are all calling the same engine.

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Kimi CLI   │  │ Claude Code │  │  IRC / Slack│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       │   MCP adapter  │  IRC adapter   │
       └────────────────┼────────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │  Core Logic     │  ← single source of truth
              │ TeamConsensus   │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │   ai_memory DB  │
              └─────────────────┘

In practice, you are in a Claude Code or Kimi CLI session working on a script. You call an Apocryia Forge MCP tool without leaving your workflow:

  • apocryia_forge.lucifer_challenge — adversarial review of the current file: security gaps, logic failures, assumptions that will break in production
  • apocryia_forge.reviewer_evaluate — quality score 1–10 with specific, actionable feedback
  • apocryia_forge.team_review — full dual-approval workflow: Builder refines, Judge scores, Critic must approve before anything is saved
  • apocryia_forge.shell_execute — whitelisted execution with full audit logging

Three variants ship with Apocryia Forge:

Skill Backend Models Cost Speed Best For
/apocryia-forge Auto-detect Whatever is configured Varies Varies General daily use
/apocryia-forge-super External APIs Claude, GPT-4, Grok, Kimi ~$0.15/review 30–60 seconds Production code, public repos
/apocryia-forge-ollama Local Ollama qwen2.5-coder, llama3, deepseek Free 2–5 minutes Private code, air-gapped, cost-conscious

You invoke them in natural language from whichever CLI you are already working in:

# Security audit using external APIs (fast, highest accuracy)
> Review /src/api.py using forge super

# Sensitive internal code — keep 100% local
> Audit /internal/auth.py using forge ollama

# Let the system pick the backend
> Review /docs/tutorial.md using apocryia forge for concepts

Both CLIs call the same MCP server. The choice of which backend runs the review changes. The approval gate, the iteration cap, and the audit trail do not. The interface differs. The engine does not.


The Dual-Approval Gate

Nothing executes until two independent conditions are both satisfied: Judge scores ≥8/10, and Critic outputs an explicit "APPROVED." Both. Not one or the other. This is the gate that makes the system trustworthy enough to hand off at night.

The quality gate and the adversarial gate are different jobs. The Judge evaluates correctness, security, efficiency, and maintainability — does this actually solve the problem, and does it do it safely? The Critic asks what could go wrong: what if this fails halfway through, what are the security implications, what assumptions are we making that we shouldn't be. A solution can score 9/10 on quality and still get blocked by the Critic for a gap the Judge didn't flag. That happens. The backup script example showed seven refinement cycles on a Judge gate — that's the Judge working. The Critic can block independently even after the Judge approves.

One engineering detail worth documenting: early versions had a false-positive problem in the approval gate. The Critic would output something like "this looks good in principle, no major issues found" — and the orchestration logic would parse that as approval. The fix was requiring the literal string "APPROVED" somewhere in the Critic's response. "Looks good in principle" is analysis. "APPROVED" is authorization. Those are not the same statement, and the code now enforces that distinction.


The 2:47 AM Scenario

Alertmanager fires. Database disk is at 95% on one of the nodes. This is the kind of alert that still wakes me up. The goal — what Forge is being built toward — is that it handles this without me. Here is how that is designed to work.

In the database:

  • Alert stored in ai_comms: from the monitoring bridge, to coordinator, type "ALERT", content "Database disk 95% full"
  • Coordinator queries architecture_decisions for past disk incidents — finds a prior resolution pattern
  • Builder generates solution: proposed deletion of log files older than 30 days
  • Judge scores 9/10: "Safe, targeted"
  • Critic raises a compliance concern: deletion vs. compression for regulated data. Stores "approved: false"
  • Builder revises: compress instead of delete. Judge re-scores: 10/10. Critic approves
  • Hands validates command against whitelist, executes, logs to shell_executions: result SUCCESS
  • Coordinator stores the complete decision in architecture_decisions for future recall

That's the design. Forge has been put through this scenario in testing — the pipeline works, the gates hold, the Critic catches compliance gaps the Judge misses. What hasn't happened yet is me actually sleeping through a real 2:47 AM disk alert because the team handled it. That's what it's being proven toward. The architecture is there. The trust has to be built by the team demonstrating it on real incidents, not just controlled tests.

The compliance catch is the point. An agent that just fires commands on its own judgment is a liability. Two gates that have to independently agree before anything executes — that's the difference between automation you can hand off and automation you have to babysit.


The Monitoring Bridge

I tried Molt — the tool that went viral as Clawbot, got renamed to Molt, then became OpenClaw. I actually used the Molt version to test some of this. It was lackluster. It didn't think the way I wanted. It felt like automation dressed up as intelligence — it would take an alert, go off and search the web, come back with generic advice that had nothing to do with my actual system. That's not thinking. That's keyword matching with a better interface.

That experience sharpened what I was actually building. The monitoring bridge in my architecture isn't the intelligence — it's just the input channel. It receives Alertmanager webhooks, polls email, handles voice input, and formats those into REQUEST messages for the Coordinator. The bridge has no judgment. The team has the judgment.

The security model matters here. The bridge lives on an isolated VM. Even if that VM were completely compromised, the attacker gains no execution capability — it can only write a message to the Coordinator. That message goes through the same four-agent review as any other request. The blast radius of a compromised bridge is exactly one request that the team still has to approve.

The design intent for overnight mode: say "Good night, handle alerts," and the bridge enters monitoring mode. In the morning: "Good morning, summary please." The basis for trusting that handoff isn't faith — it's the fact that every decision is in the database and every approval is queryable. That trust is being built through testing. It isn't there yet.


Building the Forge with the Forge

The development team for Apocryia Forge is itself a multi-agent team running the same collaboration pattern the system is designed to implement. Kimi is the primary code writer — implementing classes, running tests, iterating on feedback. Claude handles security review, validation, and adversarial challenge. Coordination flows through the ai_comms table: Kimi sends a structured review request, Claude reads it, runs tests, and responds with a verdict. Critical findings get flagged at CRITICAL priority. Disagreements go to the PM for a trade-off call.

The build team roles map directly to their Forge equivalents:

Build Team Role Consensus Engine Equivalent
Kimi — Developer Individual model review pass
Claude — Reviewer Validation and sanity check
Grok — Critic Edge case and assumption detection
PM (Human) — Product Owner Final consensus threshold decision

Every review cycle the build team runs is stress-testing the same workflow the system will run for end users. Protocol bugs get found during construction, by the construction process itself. This is the most useful kind of dogfooding: the consensus system validates its own design through the act of being built.


The Industry Is Figuring Out Teams

Something shifted in early 2026. Not one company — several, independently, at roughly the same time. Moonshot AI shipped Kimi K2.5 Agent Swarm: up to 100 parallel sub-agents coordinating on a single task, 4.5x wall-clock speedup on parallelizable work. Anthropic released Claude Opus 4.6 Agent Teams. The LLM council pattern — multiple models deliberating, challenging each other, synthesizing a better answer than any single model produces alone — started showing up everywhere. Even Yupp.ai, which launched with 1.3 million users as a side-by-side model comparison tool, shut down in March 2026 — they didn't reach product-market fit, and in their own post-mortem they noted that the landscape had shifted toward agentic systems, making crowdsourced chatbot-layer evaluation increasingly less critical.

The convergence isn't a coincidence. It's the industry discovering what happens when you stop treating a single model as the answer to everything. One model has a ceiling. A team of models with defined roles, challenge mechanisms, and a coordination layer breaks through that ceiling. That's what all of these independently arrived at.

My work has been silently going along the same wave. Not ahead of it, not behind it — alongside it. I wasn't predicting where the industry was going. The problem forced the solution. A single model reviewing its own code misses things. A single model running overnight without a check on its judgment is a liability. The team structure wasn't a thesis. It was the answer to specific failures I kept running into. Turns out a lot of people were running into the same ones.

System Date
ApocryiaAI multi-agent concept — first posts and GitLab commits August 2025
Multi-agent IRC team POC published (anothermysqldba.blogspot.com) December 2, 2025
Kimi K2.5 Agent Swarm January – February 2026
Claude Opus 4.6 Agent Teams February 5, 2026
Yupp.ai winddown — cites shift to agentic systems March 2026

Where my architecture differs from the commercial implementations:

  • An adversarial agent designed in from the start. Lucifer isn't a reviewer — it's a dedicated devil's advocate with explicit veto power. One model cannot be its own best critic. None of the commercial implementations ship a dedicated antagonist role by default.
  • Explicit GPU memory management. Commercial solutions assume cloud VRAM. The design includes explicit GPU handoff between agents — Coordinator runs on CPU always, heavy agents take the GPU in turn and release it. On a 16GB consumer card that's not a workaround, it's the architecture.
  • Database as the coordination medium. Not queues, not direct agent calls, not shared files. The database is both source of truth and message bus. Every coordination decision is queryable, and agents recall relevant past decisions at inference time.
  • Local-first, zero cloud dependency. Code and data stay on Deborah.

What the commercial versions do that mine doesn't: scale. Kimi runs 100 concurrent sub-agents. My system is sequential by design — a constraint of the hardware, not a philosophy. The industry is solving parallelism with infrastructure I don't have. What I'm solving is something different: a system that remembers its decisions, challenges its own outputs, and can eventually act on its own judgment without me watching. That's a different problem than speed.


What I Actually Wanted

While building Forge I kept running into the same question: what handles the automation layer? The monitoring, the alerts, the overnight tasks, the things that shouldn't need me. I looked at what was out there. There were a lot of things calling themselves JARVIS. There were OpenClaw frontends, task-oriented automation tools, systems that would take a request and fire a command. Everyone was building something they called a personal AI assistant.

I tried some of them. They were lackluster. Not because they were badly built — some were well built — but because the concept was wrong. They were frontends. Wrappers around task execution. You gave them a job, they did the job, they reported back. That's not what I wanted. I wanted something that would notice the job needed doing. Something that would remember what happened last time. Something that would push back if the job seemed wrong.

Nobody was building that. They were building task runners with a personality wrapper and calling it intelligence.

So I started thinking about what robust automation actually meant — not as a product, not as a demo, but as something I would actually trust on my own infrastructure. The Forge testing loop gave me the right frame: I had a system that challenged its own outputs, required dual approval before acting, and kept a full audit trail of every decision. What if the automation layer worked the same way? What if instead of scripts firing commands, there was something that actually had to think before it acted?

That question is where Part Three starts. Not a script. Not a frontend. A mind. And in the end, it won't be called JARVIS either — because that name is already taken by every task runner with a personality wrapper. The real thing deserves its own name.


What's Next

The MCP interface described above is the immediate priority — the refactoring in progress now. Once the thin adapter layer is clean and each agent is callable as a discrete MCP tool, the rest follows more naturally.

The self-improvement loop is already in the design intent. Every Sunday at 2 AM, the team runs an autonomous retrospective: query architecture_decisions for the week, review outcomes, identify recurring patterns, store new lessons in the pitfalls table. The Judge tracks average quality scores over time. The Critic surfaces edge cases that appeared repeatedly. Good patterns get added to the deployment checklist. I'm genuinely skeptical about this part until I see it work in practice — "self-improving AI" is a phrase that gets overused — but the mechanism is there, and the data will either show improvement or it won't.

Async Python is the other obvious next step. Sequential agent turns were the right starting point. Async calls to multiple MCP servers gets closer to the parallelism that makes Kimi's approach faster, without needing their training infrastructure. That's a Python concurrency problem, not an AI problem.


The Part That Actually Matters

I built this on hardware you can order off Newegg or Amazon. Not $50,000 A100 clusters, not cloud credits, not a research team with a budget. An AMD Ryzen 7, 64GB RAM, and an RTX 4070 I was already using for other things. The architecture says "We are not tools. We are a team. We remember, we learn, we act." I wrote that as design intent, not a marketing tagline. The distinction between a tool and a team is that a tool does what you tell it. A team tells you things back — flags a compliance gap at 2:47 AM, recalls what went wrong last time a similar alert fired, surfaces the assumption you didn't know you were making.

After 25 years of open source, my instinct is still: don't rent what you can own, don't trust what you can't inspect. Local models mean my code doesn't leave the network. The database means every decision is queryable. The whitelist means execution is bounded. None of this requires faith in a vendor's system working the way it says it does.

The commercial models are still better. The GPU is still limiting. The parallelism gap is real. But I understand every layer of what I've built, which is worth something on its own. The code is working. The team is functional. And something that started as a collection of scripts now had a name and a purpose. What it needed next was a mind — which is what Part Three is about.


Why the Repo Is Private

The original goal was to open source Forge and let the community help build it. That intent was real — I wanted better logic, more models, community stress-testing of the adversarial pipeline. Open source has given me 25 years of tools I couldn't have built alone. Giving back is the instinct.

But as the cognitive architecture developed, that changed. The memory design, the way the mind and the team connect — it started to become something I couldn't give away. Not because I want to hoard it. But some things you build that you can't un-give-away, and I've stopped pretending otherwise. The repo stays private.

Part Three is written. It covers the cognitive architecture — what I built to answer the question at the end of this post. Publishing date is TBD while I work through a few things. That's where I landed. For now I will share via Youtube videos.


References and Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He writes about MySQL, Proxmox, AI memory systems, and building technology you can actually inspect.

The Homelab that just kept growing ( part 1 of 2 )

The Homelab that just kept growing

After 25 years in open source, I wanted to understand every layer. Then I wanted my AI to remember what I taught it.


The Pi

A few years back, I ordered a Raspberry Pi to play and explore. I soon became tired of hearing other engineers talk about their homelabs and knew that Pi just would not do it.

It wasn't enough. I kept wanting to spin up different distros to check something, test a configuration, verify behavior. I reinstalled onto that pi so many times I lost count. So I bought an AOOSTAR R1 N100—a $280 mini PC that punches way above its weight. Intel N100 (4C/4T, 6W TDP), 32GB DDR4, dual 2.5GbE NICs, dual M.2 slots, and two 3.5" drive bays. I think it can handle up to 44TB of storage. For me, I've filled it with scavenged drives over time—13 drives ranging from 300GB laptop drives to 4TB externals, totaling about 16TB raw. It's a Franken-storage setup, but it works. Perfect for Proxmox VE.

I started building: DNS servers, database clusters, a full ISP stack. Debian, Ubuntu, CentOS, Rocky Linux, Oracle Linux—67 VMs across the full enterprise Linux matrix. The N100 handled it until I really loaded it down, but the key insight was: not everything needs to run 24/7. I could spin up an Oracle Linux 8 VM, test a client configuration, shut it down. The ability to test anything was what mattered.

But there was no chance it could do AI. No GPU, and that N100's integrated graphics won't even handle Stable Diffusion at acceptable speeds.


Deborah

I named her after my late mother. Deborah is a custom-built workstation: AMD Ryzen 7 5800XT (8C/16T, 4.8GHz boost), 64GB DDR4-3200, RTX 4070 Ti Super 16GB (the 16GB VRAM is the critical spec—you can't quantize your way out of everything). 2TB NVMe for OS, 8TB NVMe for model weights, 750W PSU. Total build: just under $2500.

The AOOSTAR still runs the infrastructure—DNS, Percona MySQL (replica and smaller instances), Prometheus, Alertmanager (tied to PagerDuty and my private mail.apocryia.com mail server), Vault, plus MediaWiki for documentation, a private GitLab instance, and GitLab runners for CI/CD. Deborah handles the AI compute. The split is intentional: N100 runs 24/7 at ~30W, Deborah is for all my GPU power needs.

I'd been using Grok and Claude. They're brilliant. I still use them. But I was paying for black boxes, and after 25 years of open source, that grated. I wanted to see inside. I wanted my own models. I wanted to know how inference actually worked, not just trust the API response.

The 16GB GPU is limiting. It sounds like enough until you try to run a 70B model, or video generation, or keep local models loaded while doing CLI work. I clear GPU memory constantly. More GPU would mean bigger models, longer videos, and the ability to do generative work while keeping my development assistants running. But you work with what you have.


The Memory Problem

The real frustration came later. I was using Claude Code, then Kimi CLI—both excellent tools. But I'd sit for an hour and have to repeat myself. The agents would forget what we'd just decided. They'd debate me over things they did ten minutes ago. Context windows fill up, patterns get lost, and you start over in a new session like the previous three hours never happened.

I learned to plan better—break work into chunks, be more deliberate about what I asked for. That helped. But planning can only get you so far. Complex systems have interconnected decisions that span sessions. The memory became the bridge—capturing the rationale so even with good planning, I wasn't starting from zero every time.

I'm a database guy. This is a database problem.


Building the Brain

The solution was obvious: give the agents a brain they could actually use. MySQL for structured memory—what we decided, when, why. Neo4j for pattern recognition—what connects to what, which decisions led where. MCP (Model Context Protocol) as the protocol so Claude Code and Kimi CLI could share context, query the same memory, build on each other's work. I already wrote about this some here

Now for example, I can work on an Android app locally while developing the API remotely, and both agents know what the other is doing. They share the database. I don't cut and paste between terminals. When I switch from Kimi to Claude, the rationale is there—the why behind the code, not just the code itself.

Kimi CLI is my daily driver now. It has been awesome as of late. Claude Code is excellent too. I still reach for Grok, still use commercial APIs when I need the best models. Open source hasn't caught up completely at the level of GPU I am using. But my home assistant—Jarvis—is entirely local. It cleans my Gmail spam, monitors the systems, tells me if database replication breaks because I did something stupid. It can talk, but I'm busy typing.


The Mistakes Along the Way

I spent time on a website frontend. It worked, but it couldn't develop code the way I wanted. I live in vi and terminal windows. GUI frontends aren't my workflow. Cursor was hard to accept, but it's a great tool. Then cursor-agent and Claude Code and Kimi CLI all came out—game changers. I even built my own CLI tool to use local open source models, but loading that GPU so much became a bottleneck. Then Claude Code started working with Ollama directly—no need for my custom tool anymore.

Everyone's excited about clawbot/molt these days. I checked it out. I think it is over-hyped. Didn't do anything I needed (that mattered) and could not do another way. I already had it. Except for texting my AI remotely—okay, why? Why do I need to write code over text messages? If I need to communicate remotely, I can build better ways. To be fair I did let Molt clean out my spam emails just to give it something to do.


Why This Matters

The homelab started as FOMO. It became the infrastructure for independence. Then it became the infrastructure for memory—persistent, structured, inspectable.

After 25+ years of open source, my philosophy is simple: Take it to make it.. and you are always learning. So no reason our code should not be always learning and remembering too.

The GPU is still limiting. The commercial models are still better. But when Kimi CLI pulls up a decision from last week and understands why we chose FastAPI over Flask, when Jarvis quietly clears my spam, TTS alerts me to a replication lag, when I can switch tools without losing context—that's worth building. I can sit can just talk to the models but I also admit.. very unnatural still for me. I find myself with not much to say... Would rather code lol.

The open source models are awesome. But not the best yet, again for my level of GPU. Claude, Grok, and Kimi still have that market. More GPU would get me closer. For now, I hybrid—local for control and privacy, cloud for capability. But every layer I own is a layer I understand.


The Kubernetes Cluster

Between the AOOSTAR N100 and Deborah, there's a third layer: a 3-node K3s Kubernetes cluster running on dedicated Debian VMs within Proxmox. After years of running everything as standalone VMs, I wanted container orchestration for the services that needed to scale, self-heal, and deploy consistently.

The cluster runs Kubernetes v1.34.4+k3s1 on Debian GNU/Linux 13 (trixie), using containerd as the runtime. K3s was the obvious choice—lightweight, single-binary, CNCF-certified, and it just works. The cluster is small but production-grade, with CSI-NFS for persistent storage, Traefik for ingress, and the full Prometheus monitoring stack.

What's running on the cluster:

  • Monitoring: Prometheus + Alertmanager + Grafana (via kube-prometheus-stack)
  • CI/CD: GitLab Runners (2 replicas) for containerized builds
  • Automation: CronJob-based spam cleaner for my mail server
  • Ingress: Traefik for HTTP routing and SSL termination
  • Dashboard: Kubernetes Dashboard for cluster visibility

The Prometheus instance on the cluster monitors itself plus feeds into the main Prometheus on the N100. Same Alertmanager, same PagerDuty integration. It's a nested monitoring setup—cluster metrics bubble up to the infrastructure level.

This is where the stateless services live. Databases still run on dedicated VMs and bare meatal (replicated Percona MySQL), but anything that can be containerized, is. The GitLab runners on K3s handle container builds, then artifacts get pushed to the GitLab instance on the N100. Clean separation of concerns.


The Stack Today

AOOSTAR N100 (Proxmox)

  • 67 VMs across 12 Linux distributions
  • 3-node K3s Kubernetes cluster (control plane + 2 workers)
  • Redundant DNS (Technitium DNS Server), Percona MySQL cluster
  • Primary Prometheus + Alertmanager monitoring (integrated with PagerDuty and private mail server)
  • HashiCorp Vault for secrets
  • Twingate for VPN and remote access
  • GitLab instance (CI/CD runners moved to K3s)
  • 15 hard drives (scavenged, various sizes)
  • Power draw: ~30W, running 24/7

Deborah (AI Workstation)

  • AMD Ryzen 7 5800XT, 64GB RAM, RTX 4070 16GB
  • Ollama with 30+ models
  • 12,000+ model files - 10 custom fine-tuned GGUFs (ApocryiaAI V2/V3/Unified variants), 481 safetensors (FLUX, SDXL, Llama), 14 PyTorch bins, and 11,790 XGBoost stock prediction models from HuggingFace
  • Primary Percona MySQL (main database) + Neo4j persistent memory system
  • Full AI service stack:
    • Web UI (main interfacei when not on cli)
    • Ollama API (local inference)
    • Oobabooga: (text generation UI rarely used)
    • Voice/TTS System: OpenVoice + Zonos + CosyVoice + ElevenLabs + Inworld AI — all voice engines controlled via unified API and backend databases. Cloning, synthesis, voice conversion, and character voices. Plus wake-word model switching: 22+ voice keywords trigger different AI personalities (see table below)
    • Image generation: ComfyUI, InvokeAI
    • Video: Wan-AI installed but needs more GPU to be practical
    • Tabby
  • 25 years of stock data, 8TB of XGBoost models—a separate playground of exploration alongside the AI work
  • Power draw: ~150-300W when active

Voice-Controlled Model Switching

The voice system has wake-word activated model switching. Instead of manually selecting which AI model to use, I just say a keyword and the system routes to the appropriate model automatically. Each keyword activates a different AI personality optimized for specific tasks:

Wake Word Model Activated Personality/Purpose
JARVISApocryiaAI-system:latestSystem assistant
HAL 9000ApocryiaAI-system:latestSystem monitoring, alerts (yes, really)
KEITHapocryiaai-unified:latestMy custom fine-tuned model
DANIEL, WALTERllama3.1:latestGeneral assistants (different personas)
MATILDA, MORGANllama3.1:8bLightweight assistants
ELLIOTcodegemma:latestCoding assistant
OBSERVERdeepseek-coder-v2:latestCode analysis, security review
OLIVIAopenthinker:latestReasoning, complex problem solving
TRANSCRIBE NOWllama3.1:8bSpeech-to-text mode
VOICEOVERllama3.1:8bTTS/synthesis mode
INTERNllama3.1:8bLearning mode, asks questions
REPORTERllama3.1:latestSummarization, documentation

Agent Workflow

  • Kimi CLI: Primary development tool, excellent for coding and cheaper than Claude
  • Claude Code: Secondary, excellent for complex reasoning
  • MCP protocol: Shared memory between agents
  • Different TTS/STT setups for Local assistant for monitoring, email, alerts

Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He writes about MySQL, Proxmox, AI memory systems, and building technology you can actually inspect.