The Team Grew. So Did the Architecture.

Three layers, five agents, and the strange experience of watching the industry ship what I built in my spare time.

Where Part One Left Off

In Part One I talked about Deborah — the hardware, the 16GB GPU constraint, and why the memory problem drove me to build ai_memory, a MySQL + Neo4j brain that lets multiple AI agents share context across sessions. That post ended with a working system: Kimi CLI and Claude Code pulling shared rationale out of a database, local scripts monitoring the homelab and clearing spam, and me spending less time repeating myself to machines.

The multi-agent side of this started earlier than the December post suggests. Back in August 2025 I was already writing about why treating a single model as the answer to everything was the wrong instinct — diversity of perspective, not a bigger box. The GitLab implementation document for the IRC bot team is from that same month. December was when I published the full write-up: four specialized agents on a private IRC server, writing code, reviewing it, blocking execution until two independent perspectives agreed. The work predates that post by several months.

By early 2026, the commercial landscape had started converging on the same multi-model pattern. Blackbox AI CLI Agents let you run multiple AI providers — Google, Anthropic, OpenAI — in parallel on the same task, with an internal judge picking the best implementation. Yupp.ai launched with 1.3 million users around the same idea: "the world's smartest AIs, side by side with you." Both are closer to the multi-model thinking than anything I was aware of in August 2025.

I should note: as of March 2026, Yupp.ai announced it's shutting down. Their explanation is worth reading — they wrote that the industry has "shifted dramatically toward agentic systems: models connected to tools, memory, and external services, rather than standalone chatbots," and that passive side-by-side comparison didn't fit where AI was going. They're right. That's the exact pivot I made two years ago.

But closer is not the same. Blackbox runs models in competition — parallel submissions, pick a winner. Apocryia Forge runs models in adversarial collaboration — one builds, one challenges, and the Critic must explicitly approve before anything happens. Blackbox gets you the best first draft. Forge gets you code that has been beaten into shape across up to 25 refinement cycles. Their judge selects. Our Critic vetoes. Those are different philosophies, and only one of them assumes the AI is wrong until proven otherwise. The other practical difference: Blackbox requires their cloud. Every task routes through remote VMs. Forge runs entirely on Deborah — my code never leaves the network.

This post is what happened after. The team has a cleaner design now — five agents with sharper role definitions, a security model I can explain in a single sentence, and a framing I didn't have before: three layers, each doing exactly one job. I'll also get into something I didn't expect: by February 2026, the industry had independently shipped some of the same architecture concepts I'd been building and working on already since the summer. The concept of IRC could have been off-putting to some, but the big picture likely was not in full focus, even for me fully. IRC is just an interface to interact. You can use Slack, email, Matrix, a web API — the interface is irrelevant. The team is the point.

Three Layers, One Clear Job Each

The architectural clarity came from asking a question I should have asked earlier: what is IRC actually for in this system? In the original proof of concept, IRC was doing too many jobs. It was the communication medium, the coordination layer, the state machine, and the audit display simultaneously. That's not good design. It made everything depend on everything else.

The v2.0 framing separates three concerns that were previously tangled together.

Layer 1 — The Mind: ai_memory (MySQL + Neo4j)

This is where actual work happens. Agents don't coordinate by chatting in IRC — they coordinate by writing to and reading from the database. The ai_comms table is the real message bus. Each agent has dedicated tables for its outputs: developer_code, reviewer_scores, lucifer_challenges, shell_executions. When a task runs, the database is working. Every decision, every quality score, every approval, every lesson learned from a past failure — it's all in there, queryable, and available for recall at inference time. I'm a database guy. This is a database problem. The memory doesn't belong in a model's context window; it belongs in a schema.

Layer 2 — Cognition: Ollama Models on the Local GPU

The thinking happens here. Each agent's turn is a model inference call. The Coordinator decides how to route a task. The Builder generates a solution. The Judge evaluates quality. The Critic challenges assumptions. The models don't hold state between turns — they think, output, and unload from GPU memory. Persistence lives in Layer 1, not here. This separation is also why model selection per role matters: you want the cognitive style right for each job, not one model pretending to be five different things.

Layer 3 — Display: Any Interface

The display layer shows you what's happening. IRC was the first implementation — think of it like ticket status updates in a project tracker. You connect with HexChat, and you see phases scroll by: RECEIVED, Analysis, Building, Score: 9/10, APPROVED, COMPLETED. But this layer is just transport. The same status updates flow through Slack, email, a web API, or MCP with identical behavior underneath, because the work happens in Layers 1 and 2. The display layer doesn't coordinate anything — it observes and reports.

ai_memory (MySQL + Neo4j)   ← THE MIND       (work and coordination)
Ollama Models on GPU        ← COGNITION      (thinking and deciding)
IRC / Slack / MCP / API     ← DISPLAY        (observability for humans)

The Five-Agent Collective

The original proof of concept had four agents: ProjectManager, Developer, Reviewer, and Lucifer. The v2.0 architecture adds a fifth: Shell — renamed Hands in the internal design — the only agent that actually executes anything against the system. That isolation matters more than the count.

Agent	Personality	Model	Role
Coordinator	ProjectManager	Qwen2.5-Coder 7B	Routes tasks, recalls relevant memory, coordinates the team
Builder	Developer	DeepSeek-Coder-V2	Generates code and solutions, iterates on feedback
Hands	Shell Executor	DeepSeek/Qwen	The only agent that executes commands — whitelisted, logged, audited
Judge	Reviewer	Custom fine-tune	Scores quality 1–10, requires ≥8 to pass
Critic	Lucifer	Mistral	Dedicated adversary — must explicitly output "APPROVED" for execution to proceed

You can talk to any agent directly, or route through the full team. Here's what that looks like in practice. A direct request to the Developer only — no review cycle, just gets done:

keith: @dev write a hello world python script
AI-Team: 👨‍💻 **Builder**: Working on request from keith...
AI-Team: 👨‍💻 **Builder**: Done.

Like going over to a developer teammate and asking directly. No full review, but fast. Now route the same request through the whole team:

keith: @pm write a Python script to say hello world
AI-Team: 📋 **Task Queued** | ID: TASK-20B3B1CF | Type: code
AI-Team: 1️⃣ Builder generates → 2️⃣ Judge reviews (≥8 required) → 3️⃣ Critic challenges/approves → 4️⃣ Code saved
AI-Team: 👨‍💻 Builder: Working on TASK-20B3B1CF...
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - straightforward, functions correctly
AI-Team: 😈 Critic: TASK-20B3B1CF APPROVED - ready for execution
AI-Team: 📦 Code saved to: generated_code/TASK-20B3B1CF.py
AI-Team: ✅ TASK-20B3B1CF COMPLETED

The real test is a task that pushes back. Here's a MySQL backup script that required eight iterations before passing both gates — the full iteration count is in the database, the IRC display just shows the status transitions:

keith: @pm write a bash script to backup MySQL databases
AI-Team: 📋 Task Queued | ID: TASK-5749EA2E
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 7/10 - NEEDS_WORK - basic solution, missing error handling...
AI-Team: 🔄 Coordinator: iteration 1/25 - refining
AI-Team: 👨‍💻 Builder: Completed (iteration 2) - ready for review
AI-Team: 👀 Judge: Score 6/10 - NEEDS_WORK - performs basic task but has several issues...
AI-Team: 🔄 Coordinator: iteration 2/25 - refining
[iterations 3-7: scores cycling 7/10, Judge flagging missing secure credential handling,
 incomplete error recovery, cleanup logic gaps]
AI-Team: 👨‍💻 Builder: Completed (iteration 8) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - secure credential handling, proper error recovery...
AI-Team: 😈 Critic: TASK-5749EA2E APPROVED - ready for execution
AI-Team: 📦 Code saved | ✅ COMPLETED | iteration_count: 7

Seven refinement cycles on a backup script. The Judge kept finding gaps — hardcoded credentials, missing error handling, incomplete cleanup — and the Builder addressed each one. The final version loaded credentials from environment variables, handled connection failures, and cleaned up properly. The Judge's standard for ≥8/10 isn't decoration. When you want to hand this off overnight, the gate has to hold.

The team also handles edit tasks — not just new code, but surgical modifications to existing files. Request a change, Builder generates the diff, same Judge/Critic gate applies, Hands saves the modified file. The audit trail includes what was changed and why.

The Super Agent: When Local Isn't Enough

The Super agent extends the team beyond local models. When a task warrants the best available reasoning — or when you want multiple commercial APIs to weigh in simultaneously — you can route directly to Grok, Claude, or any configured provider without leaving the same interface.

The Super agent supports three multi-model modes:

Command	Result
`@super ask @claude @grok <question>`	Show each model's answer separately for direct comparison
`@super CONSENSUS <question>`	Synthesize one agreed answer from all responding models
`@super compare <question>`	Highlight where models agree and where they diverge

The CONSENSUS mode is the most useful for code review: instead of asking each model separately and reconciling the answers yourself, you get a single synthesized response that surfaces disagreements inline. Same interface, local or cloud — the team doesn't care which inference backend answers a given question.

MCP Server: The Interface Most People Will Actually Use

IRC is the audit layer. It shows you what is happening, lets you interject, and preserves a human-readable log of every decision. But it is not how most developers will interact with Apocryia Forge on a daily basis. The primary interface is MCP — each agent exposed as a callable tool from any MCP-compatible CLI.

The underlying architecture is a hybrid: the core logic (TeamConsensusEngine) is the single source of truth, and each interface — IRC, MCP, HTTP API — is a thin adapter that translates its own protocol into the same core calls. This means Claude Code, Kimi CLI, or any other MCP-compatible tool gets identical behavior to what you see in IRC, because they are all calling the same engine.

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Kimi CLI   │  │ Claude Code │  │  IRC / Slack│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       │   MCP adapter  │  IRC adapter   │
       └────────────────┼────────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │  Core Logic     │  ← single source of truth
              │ TeamConsensus   │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │   ai_memory DB  │
              └─────────────────┘

In practice, you are in a Claude Code or Kimi CLI session working on a script. You call an Apocryia Forge MCP tool without leaving your workflow:

apocryia_forge.lucifer_challenge — adversarial review of the current file: security gaps, logic failures, assumptions that will break in production
apocryia_forge.reviewer_evaluate — quality score 1–10 with specific, actionable feedback
apocryia_forge.team_review — full dual-approval workflow: Builder refines, Judge scores, Critic must approve before anything is saved
apocryia_forge.shell_execute — whitelisted execution with full audit logging

Three variants ship with Apocryia Forge:

Skill	Backend	Models	Cost	Speed	Best For
`/apocryia-forge`	Auto-detect	Whatever is configured	Varies	Varies	General daily use
`/apocryia-forge-super`	External APIs	Claude, GPT-4, Grok, Kimi	~$0.15/review	30–60 seconds	Production code, public repos
`/apocryia-forge-ollama`	Local Ollama	qwen2.5-coder, llama3, deepseek	Free	2–5 minutes	Private code, air-gapped, cost-conscious

You invoke them in natural language from whichever CLI you are already working in:

# Security audit using external APIs (fast, highest accuracy)
> Review /src/api.py using forge super

# Sensitive internal code — keep 100% local
> Audit /internal/auth.py using forge ollama

# Let the system pick the backend
> Review /docs/tutorial.md using apocryia forge for concepts

Both CLIs call the same MCP server. The choice of which backend runs the review changes. The approval gate, the iteration cap, and the audit trail do not. The interface differs. The engine does not.

The Dual-Approval Gate

Nothing executes until two independent conditions are both satisfied: Judge scores ≥8/10, and Critic outputs an explicit "APPROVED." Both. Not one or the other. This is the gate that makes the system trustworthy enough to hand off at night.

The quality gate and the adversarial gate are different jobs. The Judge evaluates correctness, security, efficiency, and maintainability — does this actually solve the problem, and does it do it safely? The Critic asks what could go wrong: what if this fails halfway through, what are the security implications, what assumptions are we making that we shouldn't be. A solution can score 9/10 on quality and still get blocked by the Critic for a gap the Judge didn't flag. That happens. The backup script example showed seven refinement cycles on a Judge gate — that's the Judge working. The Critic can block independently even after the Judge approves.

One engineering detail worth documenting: early versions had a false-positive problem in the approval gate. The Critic would output something like "this looks good in principle, no major issues found" — and the orchestration logic would parse that as approval. The fix was requiring the literal string "APPROVED" somewhere in the Critic's response. "Looks good in principle" is analysis. "APPROVED" is authorization. Those are not the same statement, and the code now enforces that distinction.

The 2:47 AM Scenario

Alertmanager fires. Database disk is at 95% on one of the nodes. This is the kind of alert that still wakes me up. The goal — what Forge is being built toward — is that it handles this without me. Here is how that is designed to work.

In the database:

Alert stored in ai_comms: from the monitoring bridge, to coordinator, type "ALERT", content "Database disk 95% full"
Coordinator queries architecture_decisions for past disk incidents — finds a prior resolution pattern
Builder generates solution: proposed deletion of log files older than 30 days
Judge scores 9/10: "Safe, targeted"
Critic raises a compliance concern: deletion vs. compression for regulated data. Stores "approved: false"
Builder revises: compress instead of delete. Judge re-scores: 10/10. Critic approves
Hands validates command against whitelist, executes, logs to shell_executions: result SUCCESS
Coordinator stores the complete decision in architecture_decisions for future recall

That's the design. Forge has been put through this scenario in testing — the pipeline works, the gates hold, the Critic catches compliance gaps the Judge misses. What hasn't happened yet is me actually sleeping through a real 2:47 AM disk alert because the team handled it. That's what it's being proven toward. The architecture is there. The trust has to be built by the team demonstrating it on real incidents, not just controlled tests.

The compliance catch is the point. An agent that just fires commands on its own judgment is a liability. Two gates that have to independently agree before anything executes — that's the difference between automation you can hand off and automation you have to babysit.

The Monitoring Bridge

I tried Molt — the tool that went viral as Clawbot, got renamed to Molt, then became OpenClaw. I actually used the Molt version to test some of this. It was lackluster. It didn't think the way I wanted. It felt like automation dressed up as intelligence — it would take an alert, go off and search the web, come back with generic advice that had nothing to do with my actual system. That's not thinking. That's keyword matching with a better interface.

That experience sharpened what I was actually building. The monitoring bridge in my architecture isn't the intelligence — it's just the input channel. It receives Alertmanager webhooks, polls email, handles voice input, and formats those into REQUEST messages for the Coordinator. The bridge has no judgment. The team has the judgment.

The security model matters here. The bridge lives on an isolated VM. Even if that VM were completely compromised, the attacker gains no execution capability — it can only write a message to the Coordinator. That message goes through the same four-agent review as any other request. The blast radius of a compromised bridge is exactly one request that the team still has to approve.

The design intent for overnight mode: say "Good night, handle alerts," and the bridge enters monitoring mode. In the morning: "Good morning, summary please." The basis for trusting that handoff isn't faith — it's the fact that every decision is in the database and every approval is queryable. That trust is being built through testing. It isn't there yet.

Building the Forge with the Forge

The development team for Apocryia Forge is itself a multi-agent team running the same collaboration pattern the system is designed to implement. Kimi is the primary code writer — implementing classes, running tests, iterating on feedback. Claude handles security review, validation, and adversarial challenge. Coordination flows through the ai_comms table: Kimi sends a structured review request, Claude reads it, runs tests, and responds with a verdict. Critical findings get flagged at CRITICAL priority. Disagreements go to the PM for a trade-off call.

The build team roles map directly to their Forge equivalents:

Build Team Role	Consensus Engine Equivalent
Kimi — Developer	Individual model review pass
Claude — Reviewer	Validation and sanity check
Grok — Critic	Edge case and assumption detection
PM (Human) — Product Owner	Final consensus threshold decision

Every review cycle the build team runs is stress-testing the same workflow the system will run for end users. Protocol bugs get found during construction, by the construction process itself. This is the most useful kind of dogfooding: the consensus system validates its own design through the act of being built.

The Industry Is Figuring Out Teams

Something shifted in early 2026. Not one company — several, independently, at roughly the same time. Moonshot AI shipped Kimi K2.5 Agent Swarm: up to 100 parallel sub-agents coordinating on a single task, 4.5x wall-clock speedup on parallelizable work. Anthropic released Claude Opus 4.6 Agent Teams. The LLM council pattern — multiple models deliberating, challenging each other, synthesizing a better answer than any single model produces alone — started showing up everywhere. Even Yupp.ai, which launched with 1.3 million users as a side-by-side model comparison tool, shut down in March 2026 — they didn't reach product-market fit, and in their own post-mortem they noted that the landscape had shifted toward agentic systems, making crowdsourced chatbot-layer evaluation increasingly less critical.

The convergence isn't a coincidence. It's the industry discovering what happens when you stop treating a single model as the answer to everything. One model has a ceiling. A team of models with defined roles, challenge mechanisms, and a coordination layer breaks through that ceiling. That's what all of these independently arrived at.

My work has been silently going along the same wave. Not ahead of it, not behind it — alongside it. I wasn't predicting where the industry was going. The problem forced the solution. A single model reviewing its own code misses things. A single model running overnight without a check on its judgment is a liability. The team structure wasn't a thesis. It was the answer to specific failures I kept running into. Turns out a lot of people were running into the same ones.

System	Date
ApocryiaAI multi-agent concept — first posts and GitLab commits	August 2025
Multi-agent IRC team POC published (anothermysqldba.blogspot.com)	December 2, 2025
Kimi K2.5 Agent Swarm	January – February 2026
Claude Opus 4.6 Agent Teams	February 5, 2026
Yupp.ai winddown — cites shift to agentic systems	March 2026

Where my architecture differs from the commercial implementations:

An adversarial agent designed in from the start. Lucifer isn't a reviewer — it's a dedicated devil's advocate with explicit veto power. One model cannot be its own best critic. None of the commercial implementations ship a dedicated antagonist role by default.
Explicit GPU memory management. Commercial solutions assume cloud VRAM. The design includes explicit GPU handoff between agents — Coordinator runs on CPU always, heavy agents take the GPU in turn and release it. On a 16GB consumer card that's not a workaround, it's the architecture.
Database as the coordination medium. Not queues, not direct agent calls, not shared files. The database is both source of truth and message bus. Every coordination decision is queryable, and agents recall relevant past decisions at inference time.
Local-first, zero cloud dependency. Code and data stay on Deborah.

What the commercial versions do that mine doesn't: scale. Kimi runs 100 concurrent sub-agents. My system is sequential by design — a constraint of the hardware, not a philosophy. The industry is solving parallelism with infrastructure I don't have. What I'm solving is something different: a system that remembers its decisions, challenges its own outputs, and can eventually act on its own judgment without me watching. That's a different problem than speed.

What I Actually Wanted

While building Forge I kept running into the same question: what handles the automation layer? The monitoring, the alerts, the overnight tasks, the things that shouldn't need me. I looked at what was out there. There were a lot of things calling themselves JARVIS. There were OpenClaw frontends, task-oriented automation tools, systems that would take a request and fire a command. Everyone was building something they called a personal AI assistant.

I tried some of them. They were lackluster. Not because they were badly built — some were well built — but because the concept was wrong. They were frontends. Wrappers around task execution. You gave them a job, they did the job, they reported back. That's not what I wanted. I wanted something that would notice the job needed doing. Something that would remember what happened last time. Something that would push back if the job seemed wrong.

Nobody was building that. They were building task runners with a personality wrapper and calling it intelligence.

So I started thinking about what robust automation actually meant — not as a product, not as a demo, but as something I would actually trust on my own infrastructure. The Forge testing loop gave me the right frame: I had a system that challenged its own outputs, required dual approval before acting, and kept a full audit trail of every decision. What if the automation layer worked the same way? What if instead of scripts firing commands, there was something that actually had to think before it acted?

That question is where Part Three starts. Not a script. Not a frontend. A mind. And in the end, it won't be called JARVIS either — because that name is already taken by every task runner with a personality wrapper. The real thing deserves its own name.

What's Next

The MCP interface described above is the immediate priority — the refactoring in progress now. Once the thin adapter layer is clean and each agent is callable as a discrete MCP tool, the rest follows more naturally.

The self-improvement loop is already in the design intent. Every Sunday at 2 AM, the team runs an autonomous retrospective: query architecture_decisions for the week, review outcomes, identify recurring patterns, store new lessons in the pitfalls table. The Judge tracks average quality scores over time. The Critic surfaces edge cases that appeared repeatedly. Good patterns get added to the deployment checklist. I'm genuinely skeptical about this part until I see it work in practice — "self-improving AI" is a phrase that gets overused — but the mechanism is there, and the data will either show improvement or it won't.

Async Python is the other obvious next step. Sequential agent turns were the right starting point. Async calls to multiple MCP servers gets closer to the parallelism that makes Kimi's approach faster, without needing their training infrastructure. That's a Python concurrency problem, not an AI problem.

The Part That Actually Matters

I built this on hardware you can order off Newegg or Amazon. Not $50,000 A100 clusters, not cloud credits, not a research team with a budget. An AMD Ryzen 7, 64GB RAM, and an RTX 4070 I was already using for other things. The architecture says "We are not tools. We are a team. We remember, we learn, we act." I wrote that as design intent, not a marketing tagline. The distinction between a tool and a team is that a tool does what you tell it. A team tells you things back — flags a compliance gap at 2:47 AM, recalls what went wrong last time a similar alert fired, surfaces the assumption you didn't know you were making.

After 25 years of open source, my instinct is still: don't rent what you can own, don't trust what you can't inspect. Local models mean my code doesn't leave the network. The database means every decision is queryable. The whitelist means execution is bounded. None of this requires faith in a vendor's system working the way it says it does.

The commercial models are still better. The GPU is still limiting. The parallelism gap is real. But I understand every layer of what I've built, which is worth something on its own. The code is working. The team is functional. And something that started as a collection of scripts now had a name and a purpose. What it needed next was a mind — which is what Part Three is about.

Why the Repo Is Private

The original goal was to open source Forge and let the community help build it. That intent was real — I wanted better logic, more models, community stress-testing of the adversarial pipeline. Open source has given me 25 years of tools I couldn't have built alone. Giving back is the instinct.

But as the cognitive architecture developed, that changed. The memory design, the way the mind and the team connect — it started to become something I couldn't give away. Not because I want to hoard it. But some things you build that you can't un-give-away, and I've stopped pretending otherwise. The repo stays private.

Part Three is written. It covers the cognitive architecture — what I built to answer the question at the end of this post. Publishing date is TBD while I work through a few things. That's where I landed. For now I will share via Youtube videos.

References and Related Reading

Homelab Part One: I Built a Homelab Because I Was Jealous — The hardware story, the memory problem, and why this started
Open Source AI Models: Building a Development Team (December 2, 2025) — The original four-agent POC that predates the commercial releases
Sure, Speed Is Nice, But Why? (August 2025) — The philosophy post that predates the implementation; the real starting point
MySQL Analysis with an AI-Powered CLI Tool (July 2025) — The earlier diversity-of-models thinking that led here
Beyond Vector Databases: How AI Actually Remembers — The technical architecture of the persistent memory system
Ollama — Local model runtime used throughout this system
Model Context Protocol (MCP) — The protocol for the agent-as-MCP-server refactoring path

Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He writes about MySQL, Proxmox, AI memory systems, and building technology you can actually inspect.

Pages

Sunday, April 19, 2026

The Homelab that just kept growing(part 2 of 2 ) :: The Team Grew. So Did the Architecture.