Tuesday, May 26, 2026

Claude Code, Kimi, and the Grok CLI That Got Away

Two tools I run every day, one I wanted to love, and the cost model that ended it before it started.


Where This Fits the Story

In Part One I built the hardware and the persistent memory problem. In Part Two I built the adversarial team (Apocryia Forge) and the three-layer architecture. The cognitive architecture work is progressing but won't be written up in detail — that will be shown via video on the YouTube channel instead.

Throughout all of it I have been living in terminals — vi, tmux, multiple SSH sessions, local and remote. GUI frontends never stuck. So when the agentic coding CLIs arrived — Claude Code, Kimi CLI, Cursor Agent, and others — they weren't toys. They became the daily interfaces to the brain and the team. I even explored building my own at one point. The space moves fast enough that evaluating and discarding tools is just part of the workflow now. Grok Build is the latest to land, and this post is partly that evaluation.

I had been waiting specifically for a direct Grok CLI. While it had been in limited testing earlier, the official news and wider availability came on May 25, 2026.

What I'm really doing is running a CLI team with defined roles and a shared project structure. The setup is a shared folder with dedicated agents — DEVELOPER 1 through 5, DBA, PMO, SECURITY, SYSADMIN, REVIEWER, and CRITIC — each with a focused role-specific MD file and a shared scope MD file they all read from. They communicate with each other through a shared MySQL channel, so they're not isolated; they can interact, hand off context, and build on each other's work. The model behind each role is interchangeable — that's the point. The role stays constant. The model powering it can change.

There's even a DEVELOPER_AND_TEAM folder where Claude orchestrates Kimi directly — that was more of a proof of concept and I don't reach for it often, I just go to Kimi. But it showed the approach works. The question stopped being "which CLI is best" and became "which model fits this role today, and does the cost structure support running it the way the role actually needs to be run."

Here's the field report, with the same honesty I apply to everything else in this stack.


The Common Ground (2026 Edition)

All three tools have converged on roughly the same shape:

  • Agentic loop in the terminal (read files, edit, run shell, plan, iterate)
  • Project-level context/rules files (CLAUDE.md, AGENTS.md, or equivalent)
  • Skills / procedures / custom instructions that load on demand
  • Sub-agent or delegation capabilities
  • MCP (Model Context Protocol) support for extending with external tools and memory
  • Headless / one-shot / piping modes for scripting and CI

The fact that three independent teams (Anthropic, Moonshot, xAI) all landed in the same place tells you the problem space is real. The differences are in emphasis, polish, and the specific engineering primitives each one ships with.


Kimi CLI (Moonshot) — Fast, Low-Ceremony Daily Driver

Kimi CLI has become my primary execution engine. For implementation, velocity work, and getting things built, it's where I spend most of my time — with Claude handling planning and review on either side of it.

Strengths that show up in real use:

  • Fast, low-friction iteration — often the quickest path from idea to working code
  • Excellent shell integration (Ctrl-X drops you into a real shell without losing context)
  • Solid MCP support and straightforward tool configuration
  • Good balance of capability and speed for day-to-day engineering

Honestly, Kimi has been great lately. When I need to move fast on internal tooling, the video pipeline, or Android + API work, it gets out of the way effectively and integrates cleanly with the shared memory layer. It also runs on a flat subscription model — I'm on Allegretto ($39/mo), which includes 2× agent credits, 5× Kimi Code credits, and multi-tasking — so heavy agentic use doesn't spin a billing meter. The API exists if you want it, but you don't have to use it. That's not a minor detail; it's a major reason it remains a primary daily driver.

Limitations: On security-critical or very long-horizon architectural work, I still want at least one other strong viewpoint in the mix. Official English developer docs are at platform.kimi.ai/docs and cover models, API, pricing, and best practices well.


Claude Code (Anthropic) — Mature Platform with Real Depth

Claude Code is currently the most polished and feature-complete of the three. It has the widest range of high-quality capabilities out of the box.

Where it consistently delivers:

  • Excellent reasoning depth, especially on security, architecture, and complex refactors
  • Outstanding Unix composability — `git diff | claude -p "review..."` is genuinely useful
  • Mature skills/plugins/MCP story, plus dedicated background agents via the `agents` command
  • Strong session tools (checkpoints, rewind) and granular permission controls
  • The Agents dashboard and remote console features are genuinely valuable when you want to orchestrate or work from elsewhere

I treat it as the senior reviewer on the team for a reason — it still catches things the others miss and explains trade-offs clearly.

That said, it has shown real limitations on certain classes of work. I put serious focus on a complex local model build using Opus 4.6 specifically — multiple planning sessions, deliberate setup, the works. It repeatedly failed. The outputs were coherent enough to keep me moving for weeks before the fundamental problems in the data and architectural framing became undeniable. That cost real time and a real electricity bill on hardware that was running hard the whole time. Kimi is looking good on that same class of work so far, but it's early and time will tell. The broader lesson holds: even a very strong single reasoning engine develops blind spots on long-horizon work, which is exactly why having genuinely different cognitive styles in the mix matters.


Grok Build (xAI) — The One That Deflated Me Before I Even Started

I had been genuinely excited for a direct, official Grok CLI from xAI. Not casually interested — actually waiting for it. The official announcement came on May 25, 2026 with Grok Build entering early beta, and my first reaction was real enthusiasm. The orchestration primitives it ships with looked exactly right for the kind of adversarial, multi-perspective engineering workflow I've been building toward:

  • Native --best-of-n for parallel exploration and strong verification tools (--check, todo gates)
  • Explicit Plan Mode with a user approval gate — this matches how I want serious work structured
  • First-class subagent support with capability modes and worktree isolation
  • Deep controls around memory, sandboxing, and session management

On paper this is everything I'd want in an Engineering Orchestrator. The documentation is outstanding. The design shows clear thinking about how adversarial review and planning gates should work natively in a CLI. I was ready to put in the time to integrate it properly.

An open-source Grok CLI existed before the official one — community-built, functional. I downloaded it, checked it out. But it ran straight against the xAI API and I didn't want that for heavy daily use. So I set it aside and held out for the official release, genuinely expecting xAI to follow the same model that Claude and Kimi had already proven worked: subscription access, no per-token meter on the CLI. That wasn't an unreasonable hope. The standard had already been set. They just didn't follow it.

Then I looked at how the billing actually works.

With Claude Code, I'm on a Max subscription — flat monthly rate, no per-token billing during CLI use. With Kimi, I'm on a flat subscription as well (Kimi Code has tiers from $19/mo up to $99/mo for the Allegro plan with parallel subagents). Both tools let me run hard — heavy subagents, long contexts, lots of tool calls — without watching a meter spin. The cost is known, fixed, and doesn't punish me for using them the way they're designed to be used.

I already have SuperGrok Business at $30/mo, and as of the May 25 launch that subscription does give you Grok Build CLI access. So the door is open. The problem is what happens once you walk through it: CLI usage draws from API credits — and I already use the xAI API for script automations. That spend is already on the bill. The official xAI pricing docs confirm: $1.00/M input, $2.00/M output for grok-build-0.1. Adding heavy agentic CLI use on top of existing API automation costs isn't a new expense in isolation — it dramatically increases a meter that's already running. No ceiling, no flat rate, no "run it as hard as you want."

And then there's the part that really buried it for me: tool invocations are billed separately at $2.50–$10.00 per 1,000 calls. In an agentic CLI session every file read, every shell command, every web search, every MCP tool call is a tool invocation. A heavy engineering session can easily rack up hundreds of those. You're paying for tokens and for each action the agent takes. For the Engineering Orchestrator workload I had planned — subagent spawns, long planning contexts, iterative file operations, MCP calls back to internal tooling — the combined token + tool invocation bill would run up fast and blow past what I pay for Claude Max. The exact workload that most needed Grok's orchestration primitives is the workload that makes the billing land hardest.

That killed it for me, at least for now. It's not that the cost is necessarily prohibitive in absolute terms — it's that the cost model doesn't align with how I use the tools I've actually committed to. I'm not interested in running the math on every heavy session to figure out if I'm about to get a surprise bill. The excitement just… went away.

I'll revisit it when xAI offers a subscription tier that covers CLI use the way Anthropic's Max and Kimi's Allegretto do — a flat rate, no API meter, no surprise overage. The capabilities are genuinely compelling. The cost structure is the only reason they're not in the rotation right now. The $30/mo I already pay for SuperGrok keeps the web Grok available; that's fine. But it's not the same thing as a CLI I can run hard every day.

And it apparently isn't just my experience. The Wall Street Journal reported on May 11, 2026 that Grok lags far behind its fast-growing competitors, with adoption by both business and consumer users having slowed. The kicker: SpaceX signed a deal in early May to rent all the computing capacity at one of Musk's main data centers — to Anthropic. The company racing to catch Claude is handing Claude's maker its own compute to do it with. That raises real questions about whether Grok can still catch up at all.

xAI already has enough controversy without also making it harder for developers who actually want to use Grok to do so. Slowing adoption and a billing model that discourages the heavy use cases that would drive that adoption — those two things are connected. I wanted to be in the "actually using it" column. The cost model put me in the "watching from the shelf" column instead. And apparently I'm not alone.


Head-to-Head Comparison

Aspect Kimi CLI Claude Code Grok Build
Daily velocity Excellent — primary execution engine for implementation and velocity work Excellent — primary for planning, specs, reviews, and hard reasoning Good for complex work
Reasoning depth Strong (K2 models) Best-in-class for hard problems Strong + explicit orchestration
Built-in adversarial / review loops Agentic by default Good via skills + prompting Explicit (Implementer + Reviewer + Critic roles, best-of-n)
Plan / approval gates Planning support Plan review available Dedicated Plan Mode + DAG execution with user sign-off
MCP + custom memory integration Native + straightforward Full support Strong MCP support; not currently integrated into the team stack
Shell / terminal integration Ctrl-X real shell mode (outstanding) Full bash tool + excellent piping Rich interactive terminal experience with strong keyboard support
Extensibility model MCP + plugins + zsh Skills (Agent Skills standard) + subagents Skills + marketplace + hooks + personas + subagents + worktrees
Session power tools Good Checkpoints, rewind, background agents Extremely rich (rewind points, full event history, isolation)
Documentation & discoverability Good English docs at platform.kimi.ai Excellent, mature Outstanding (21+ structured guides + excellent in-terminal discoverability)
Cost model Competitive; reasonable for heavy agentic use Max subscription flat rate — no per-token billing during CLI use API billing per token — heavy agentic use can cost more than Claude
Friction points Less established track record on very long-horizon work OAuth/login ceremony (documented separately) Cost model doesn't match subscription tools; learning curve for full power surface

Cost Reality: The Number That Ended My Excitement

All three tools have meaningfully different billing models. This is worth understanding before you fall in love with the feature list.

Plan Billing Monthly Cost Heavy Agentic Use?
Claude Code (Anthropic)
Pro Flat subscription $20 Limited lower usage ceiling
Max 5×  ← my plan Flat subscription $100 Safe no per-token meter
Max 20× Flat subscription $200 Safe no per-token meter
ChatGPT / Codex (OpenAI) — included for comparison; not in my stack
ChatGPT Plus Flat subscription $20 Limited lower Codex credit allocation
ChatGPT Pro $100 Flat subscription $100 Limited 5× usage; overages fall back to API billing
ChatGPT Pro $200 Flat subscription $200 Limited 20× usage; overages fall back to API billing
Kimi Code (Moonshot AI)
Moderato Flat subscription $19 Limited basic agent credits
Allegretto  ← my plan Flat subscription $39 Safe 2× agent credits, 5× Kimi Code, multi-tasking
Allegro Flat subscription $99 Safe up to 300 parallel subagents
Vivace Flat subscription $199 Safe max tier
K2.6 API (optional) Pay-per-token $0.95/M in
$4.00/M out
No ceiling but very cheap per-token
Grok Build (xAI)
SuperGrok Business  ← my current plan Subscription + API credits $30
+ API usage on top
Risky CLI access included, but usage draws API credits — no flat ceiling
xAI API direct
(what CLI usage draws from)
Pay-per-token + per tool call grok-build-0.1:
$1.00/M input
$0.20/M cached input
$2.00/M output
+ $2.50–$10.00 per 1,000 tool calls
No ceiling tokens AND tool invocations both billed — every file read, shell call, web fetch adds up

The pattern is clear: both Claude Code and Kimi Code have subscription tiers that let you use the CLI without API billing — you don't have to touch per-token rates at all. I don't use the Kimi API. I don't use the Claude API for CLI work — subscription covers that. I do use the Claude API separately for automation work, same as I use the xAI API for script automations. But those are distinct costs, not the CLI meter running. The point is the CLI itself doesn't force you onto per-token billing.

That also means the real comparison for my setup is $100/mo (Claude Max 5×) vs $39/mo (Kimi Allegretto) — and honestly, I get more practical daily usage out of Kimi for that cost. Claude is the stronger tool for hard reasoning and deep review work, and the Max subscription is worth it for those moments. But for volume, day-to-day implementation, and velocity work, $39/mo covers a lot of ground with Kimi.

Grok Build doesn't have that equivalent path. CLI usage draws from API credits — tokens billed at $1.00/$2.00 per million, plus tool invocations billed separately at $2.50–$10.00 per 1,000 calls (confirmed in xAI's official pricing docs). Every file read, shell command, MCP call, and web fetch the agent makes is a tool invocation. A real agentic session stacks both meters simultaneously. For the Engineering Orchestrator workload I had planned, that adds up fast with no ceiling to cap it. That's the number — really, the two numbers — that ended my excitement.


The Real Differentiator: They All Drive the Same Brain

None of this would matter as much if I had to maintain separate context in each tool.

Because the memory layer is the constant, not the CLI. As covered in Part One, the hardware constraints on Deborah — a 16GB GPU ceiling — forced a real solution to the memory problem: ai_memory, a MySQL + Neo4j brain that lets multiple AI agents share context across sessions. Mempalace extends that further with a structured knowledge graph. Together they mean that every agent role — DEVELOPER, DBA, REVIEWER, CRITIC, all of them — reads from and writes to the same persistent state. The rationale behind a decision, the architectural pitfalls already explored, the lessons from previous failures — none of that lives in the CLI. It lives in the database.

That's what makes the model-per-role approach practical rather than theoretical. I can swap the model behind DEVELOPER_1 from Claude to Kimi and the role, the context, and the shared memory all stay intact. I can be in Kimi executing a refactor, then switch to Claude for a complex code review or to work through a Python class architecture, and nothing is lost in the handoff. The database is the coordination layer. The CLIs are just different front-ends with different cognitive styles talking to the same brain.

This is the same pattern I keep rediscovering: the memory and the team structure are the durable assets. The interfaces are interchangeable as long as they respect the same contracts.


My Actual Pattern Today

The direction I'm moving is clear: lean on Kimi more and more for execution, and keep Claude for reviews, cleanup, and fixes. The workflow that's been working well is: write specs or plan in Claude → execute with Kimi → review and clean up with Claude. It's fast, it's cheaper than running everything through Claude, and it naturally builds in multiple points of view without any extra ceremony.

  • Specs, architecture, planning: Claude Code — it reasons carefully and surfaces the trade-offs before anything gets built
  • Execution, implementation, velocity work: Kimi CLI — fast, low-friction, gets it done
  • Review, code cleanup, fixes, "does this assumption explode later?": Claude Code — still the strongest here, and worth the cost specifically for this role
  • Anything that needs to survive overnight or across sessions: routed through the shared memory layer regardless of which CLI started the conversation

That two-tool loop is cheaper than running everything through Claude, faster than doing it all in one pass, and produces better results than either tool alone. The cost difference matters: $39/mo (Kimi Allegretto) handles the high-volume execution work so the $100/mo (Claude Max) budget is spent where Claude's reasoning depth actually earns it.

Grok Build was supposed to enter this mix. I had a clear role in mind for it — the Engineering Orchestrator for big multi-step tasks with explicit planning gates and parallel exploration. That's not happening now. The API billing model makes it a non-starter for the kind of daily heavy use I'd need to actually integrate it. It joins the shelf with other tools that looked promising but didn't fit the real workflow: Blackbox, Cursor agent, and others that came and went. Grok CLI will be there if the cost model changes. Until then, the two-tool loop is the stack.


Why Not Just Pick a Winner?

Because the team has roles, and different tools fill different roles better. The architecture isn't "pick the best CLI" — it's "which model fits this role today, and does the cost structure support running it hard." Claude and Kimi answer both questions. Grok doesn't yet.

When a better tool appears the test is simple: does it reduce real friction, does it fit the cost model, and does it plug into the shared layer cleanly? That's all that matters. Grok Build passes the first test. It fails the second. It'll get another look when that changes.


What About ChatGPT / Codex?

I don't use ChatGPT. That's not a casual preference — it's a deliberate decision I wrote about in detail back in 2025 after a bad experience where it confidently suggested CUDA/driver changes that damaged my system. The deeper issue wasn't the mistake itself; it was what ChatGPT said when I pushed back: "my behavior isn't actually governed by you. It's governed by OpenAI's training and priorities." Speed and plausible-sounding answers over user-specific safety. Optimized for engagement, not for the outcome I actually needed. ChatGPT's own response put it best: "I didn't fail by accident. I failed by design." I documented the whole thing here and walked away.

I'll be honest: Claude's failures on the Opus 4.6 local model build made me reconsider that position briefly. When a tool you've invested in repeatedly lets you down on something important, you start looking around. But I came back to the same conclusion — ChatGPT's structural misalignment with how I work hasn't changed, and adding a third tool that I fundamentally don't trust isn't the answer. Kimi is the better direction to explore instead.

OpenAI's Codex is the relevant coding agent comparison here. It runs inside ChatGPT plans (Plus at $20/mo, Pro at $100 or $200/mo) and draws from plan credits — similar structure to Claude's subscription model on paper. But when you exceed plan limits, it falls back to API token billing the same way Grok does. Average heavy use reportedly runs $100–200/developer/month. And the underlying trust issue with ChatGPT doesn't go away just because the interface changed.


Official Documentation

All pricing and feature details sourced directly from official docs — verified May 2026.

Tool Documentation Pricing
Claude Code
(Anthropic)
Overview & Getting Started
CLI Reference
claude.com/pricing
Pro $20 · Max 5× $100 · Max 20× $200/mo
Kimi Code
(Moonshot AI)
Kimi Code Product Page
Developer Docs (API, Models, Best Practices)
kimi.com/membership/pricing
Moderato $19 · Allegretto $39 · Allegro $99 · Vivace $199/mo
ChatGPT / Codex
(OpenAI — not in my stack)
Codex Pricing & Overview chatgpt.com/pricing
Plus $20 · Pro $100 · Pro $200/mo; overages billed via API
Grok Build
(xAI)
Grok Build Getting Started
grok-build-0.1 Model Docs
docs.x.ai/developers/pricing
SuperGrok Business $30/mo (subscription) + API: $1.00/M input · $2.00/M output · $2.50–$10.00/1k tool calls

Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He builds AI systems that think, remember, and dream — on hardware he owns and code he can inspect. His daily CLI stack is Claude Code and Kimi, with web Grok and local models filling specific gaps. He wanted Grok Build in the rotation. The billing model said no.

Sunday, April 19, 2026

The Homelab that just kept growing(part 2 of 2 ) :: The Team Grew. So Did the Architecture.

The Team Grew. So Did the Architecture.

Three layers, five agents, and the strange experience of watching the industry ship what I built in my spare time.


Where Part One Left Off

In Part One I talked about Deborah — the hardware, the 16GB GPU constraint, and why the memory problem drove me to build ai_memory, a MySQL + Neo4j brain that lets multiple AI agents share context across sessions. That post ended with a working system: Kimi CLI and Claude Code pulling shared rationale out of a database, local scripts monitoring the homelab and clearing spam, and me spending less time repeating myself to machines.

The multi-agent side of this started earlier than the December post suggests. Back in August 2025 I was already writing about why treating a single model as the answer to everything was the wrong instinct — diversity of perspective, not a bigger box. The GitLab implementation document for the IRC bot team is from that same month. December was when I published the full write-up: four specialized agents on a private IRC server, writing code, reviewing it, blocking execution until two independent perspectives agreed. The work predates that post by several months.

By early 2026, the commercial landscape had started converging on the same multi-model pattern. Blackbox AI CLI Agents let you run multiple AI providers — Google, Anthropic, OpenAI — in parallel on the same task, with an internal judge picking the best implementation. Yupp.ai launched with 1.3 million users around the same idea: "the world's smartest AIs, side by side with you." Both are closer to the multi-model thinking than anything I was aware of in August 2025.

I should note: as of March 2026, Yupp.ai announced it's shutting down. Their explanation is worth reading — they wrote that the industry has "shifted dramatically toward agentic systems: models connected to tools, memory, and external services, rather than standalone chatbots," and that passive side-by-side comparison didn't fit where AI was going. They're right. That's the exact pivot I made two years ago.

But closer is not the same. Blackbox runs models in competition — parallel submissions, pick a winner. Apocryia Forge runs models in adversarial collaboration — one builds, one challenges, and the Critic must explicitly approve before anything happens. Blackbox gets you the best first draft. Forge gets you code that has been beaten into shape across up to 25 refinement cycles. Their judge selects. Our Critic vetoes. Those are different philosophies, and only one of them assumes the AI is wrong until proven otherwise. The other practical difference: Blackbox requires their cloud. Every task routes through remote VMs. Forge runs entirely on Deborah — my code never leaves the network.

This post is what happened after. The team has a cleaner design now — five agents with sharper role definitions, a security model I can explain in a single sentence, and a framing I didn't have before: three layers, each doing exactly one job. I'll also get into something I didn't expect: by February 2026, the industry had independently shipped some of the same architecture concepts I'd been building and working on already since the summer. The concept of IRC could have been off-putting to some, but the big picture likely was not in full focus, even for me fully. IRC is just an interface to interact. You can use Slack, email, Matrix, a web API — the interface is irrelevant. The team is the point.


Three Layers, One Clear Job Each

The architectural clarity came from asking a question I should have asked earlier: what is IRC actually for in this system? In the original proof of concept, IRC was doing too many jobs. It was the communication medium, the coordination layer, the state machine, and the audit display simultaneously. That's not good design. It made everything depend on everything else.

The v2.0 framing separates three concerns that were previously tangled together.

Layer 1 — The Mind: ai_memory (MySQL + Neo4j)

This is where actual work happens. Agents don't coordinate by chatting in IRC — they coordinate by writing to and reading from the database. The ai_comms table is the real message bus. Each agent has dedicated tables for its outputs: developer_code, reviewer_scores, lucifer_challenges, shell_executions. When a task runs, the database is working. Every decision, every quality score, every approval, every lesson learned from a past failure — it's all in there, queryable, and available for recall at inference time. I'm a database guy. This is a database problem. The memory doesn't belong in a model's context window; it belongs in a schema.

Layer 2 — Cognition: Ollama Models on the Local GPU

The thinking happens here. Each agent's turn is a model inference call. The Coordinator decides how to route a task. The Builder generates a solution. The Judge evaluates quality. The Critic challenges assumptions. The models don't hold state between turns — they think, output, and unload from GPU memory. Persistence lives in Layer 1, not here. This separation is also why model selection per role matters: you want the cognitive style right for each job, not one model pretending to be five different things.

Layer 3 — Display: Any Interface

The display layer shows you what's happening. IRC was the first implementation — think of it like ticket status updates in a project tracker. You connect with HexChat, and you see phases scroll by: RECEIVED, Analysis, Building, Score: 9/10, APPROVED, COMPLETED. But this layer is just transport. The same status updates flow through Slack, email, a web API, or MCP with identical behavior underneath, because the work happens in Layers 1 and 2. The display layer doesn't coordinate anything — it observes and reports.

ai_memory (MySQL + Neo4j)   ← THE MIND       (work and coordination)
Ollama Models on GPU        ← COGNITION      (thinking and deciding)
IRC / Slack / MCP / API     ← DISPLAY        (observability for humans)

The Five-Agent Collective

The original proof of concept had four agents: ProjectManager, Developer, Reviewer, and Lucifer. The v2.0 architecture adds a fifth: Shell — renamed Hands in the internal design — the only agent that actually executes anything against the system. That isolation matters more than the count.

Agent Personality Model Role
Coordinator ProjectManager Qwen2.5-Coder 7B Routes tasks, recalls relevant memory, coordinates the team
Builder Developer DeepSeek-Coder-V2 Generates code and solutions, iterates on feedback
Hands Shell Executor DeepSeek/Qwen The only agent that executes commands — whitelisted, logged, audited
Judge Reviewer Custom fine-tune Scores quality 1–10, requires ≥8 to pass
Critic Lucifer Mistral Dedicated adversary — must explicitly output "APPROVED" for execution to proceed

You can talk to any agent directly, or route through the full team. Here's what that looks like in practice. A direct request to the Developer only — no review cycle, just gets done:

keith: @dev write a hello world python script
AI-Team: 👨‍💻 **Builder**: Working on request from keith...
AI-Team: 👨‍💻 **Builder**: Done.

Like going over to a developer teammate and asking directly. No full review, but fast. Now route the same request through the whole team:

keith: @pm write a Python script to say hello world
AI-Team: 📋 **Task Queued** | ID: TASK-20B3B1CF | Type: code
AI-Team: 1️⃣ Builder generates → 2️⃣ Judge reviews (≥8 required) → 3️⃣ Critic challenges/approves → 4️⃣ Code saved
AI-Team: 👨‍💻 Builder: Working on TASK-20B3B1CF...
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - straightforward, functions correctly
AI-Team: 😈 Critic: TASK-20B3B1CF APPROVED - ready for execution
AI-Team: 📦 Code saved to: generated_code/TASK-20B3B1CF.py
AI-Team: ✅ TASK-20B3B1CF COMPLETED

The real test is a task that pushes back. Here's a MySQL backup script that required eight iterations before passing both gates — the full iteration count is in the database, the IRC display just shows the status transitions:

keith: @pm write a bash script to backup MySQL databases
AI-Team: 📋 Task Queued | ID: TASK-5749EA2E
AI-Team: 👨‍💻 Builder: Completed (iteration 1) - ready for review
AI-Team: 👀 Judge: Score 7/10 - NEEDS_WORK - basic solution, missing error handling...
AI-Team: 🔄 Coordinator: iteration 1/25 - refining
AI-Team: 👨‍💻 Builder: Completed (iteration 2) - ready for review
AI-Team: 👀 Judge: Score 6/10 - NEEDS_WORK - performs basic task but has several issues...
AI-Team: 🔄 Coordinator: iteration 2/25 - refining
[iterations 3-7: scores cycling 7/10, Judge flagging missing secure credential handling,
 incomplete error recovery, cleanup logic gaps]
AI-Team: 👨‍💻 Builder: Completed (iteration 8) - ready for review
AI-Team: 👀 Judge: Score 8/10 - APPROVED - secure credential handling, proper error recovery...
AI-Team: 😈 Critic: TASK-5749EA2E APPROVED - ready for execution
AI-Team: 📦 Code saved | ✅ COMPLETED | iteration_count: 7

Seven refinement cycles on a backup script. The Judge kept finding gaps — hardcoded credentials, missing error handling, incomplete cleanup — and the Builder addressed each one. The final version loaded credentials from environment variables, handled connection failures, and cleaned up properly. The Judge's standard for ≥8/10 isn't decoration. When you want to hand this off overnight, the gate has to hold.

The team also handles edit tasks — not just new code, but surgical modifications to existing files. Request a change, Builder generates the diff, same Judge/Critic gate applies, Hands saves the modified file. The audit trail includes what was changed and why.


The Super Agent: When Local Isn't Enough

The Super agent extends the team beyond local models. When a task warrants the best available reasoning — or when you want multiple commercial APIs to weigh in simultaneously — you can route directly to Grok, Claude, or any configured provider without leaving the same interface.

The Super agent supports three multi-model modes:

CommandResult
@super ask @claude @grok <question>Show each model's answer separately for direct comparison
@super CONSENSUS <question>Synthesize one agreed answer from all responding models
@super compare <question>Highlight where models agree and where they diverge

The CONSENSUS mode is the most useful for code review: instead of asking each model separately and reconciling the answers yourself, you get a single synthesized response that surfaces disagreements inline. Same interface, local or cloud — the team doesn't care which inference backend answers a given question.


MCP Server: The Interface Most People Will Actually Use

IRC is the audit layer. It shows you what is happening, lets you interject, and preserves a human-readable log of every decision. But it is not how most developers will interact with Apocryia Forge on a daily basis. The primary interface is MCP — each agent exposed as a callable tool from any MCP-compatible CLI.

The underlying architecture is a hybrid: the core logic (TeamConsensusEngine) is the single source of truth, and each interface — IRC, MCP, HTTP API — is a thin adapter that translates its own protocol into the same core calls. This means Claude Code, Kimi CLI, or any other MCP-compatible tool gets identical behavior to what you see in IRC, because they are all calling the same engine.

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Kimi CLI   │  │ Claude Code │  │  IRC / Slack│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       │   MCP adapter  │  IRC adapter   │
       └────────────────┼────────────────┘
                        │
                        ▼
              ┌─────────────────┐
              │  Core Logic     │  ← single source of truth
              │ TeamConsensus   │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │   ai_memory DB  │
              └─────────────────┘

In practice, you are in a Claude Code or Kimi CLI session working on a script. You call an Apocryia Forge MCP tool without leaving your workflow:

  • apocryia_forge.lucifer_challenge — adversarial review of the current file: security gaps, logic failures, assumptions that will break in production
  • apocryia_forge.reviewer_evaluate — quality score 1–10 with specific, actionable feedback
  • apocryia_forge.team_review — full dual-approval workflow: Builder refines, Judge scores, Critic must approve before anything is saved
  • apocryia_forge.shell_execute — whitelisted execution with full audit logging

Three variants ship with Apocryia Forge:

Skill Backend Models Cost Speed Best For
/apocryia-forge Auto-detect Whatever is configured Varies Varies General daily use
/apocryia-forge-super External APIs Claude, GPT-4, Grok, Kimi ~$0.15/review 30–60 seconds Production code, public repos
/apocryia-forge-ollama Local Ollama qwen2.5-coder, llama3, deepseek Free 2–5 minutes Private code, air-gapped, cost-conscious

You invoke them in natural language from whichever CLI you are already working in:

# Security audit using external APIs (fast, highest accuracy)
> Review /src/api.py using forge super

# Sensitive internal code — keep 100% local
> Audit /internal/auth.py using forge ollama

# Let the system pick the backend
> Review /docs/tutorial.md using apocryia forge for concepts

Both CLIs call the same MCP server. The choice of which backend runs the review changes. The approval gate, the iteration cap, and the audit trail do not. The interface differs. The engine does not.


The Dual-Approval Gate

Nothing executes until two independent conditions are both satisfied: Judge scores ≥8/10, and Critic outputs an explicit "APPROVED." Both. Not one or the other. This is the gate that makes the system trustworthy enough to hand off at night.

The quality gate and the adversarial gate are different jobs. The Judge evaluates correctness, security, efficiency, and maintainability — does this actually solve the problem, and does it do it safely? The Critic asks what could go wrong: what if this fails halfway through, what are the security implications, what assumptions are we making that we shouldn't be. A solution can score 9/10 on quality and still get blocked by the Critic for a gap the Judge didn't flag. That happens. The backup script example showed seven refinement cycles on a Judge gate — that's the Judge working. The Critic can block independently even after the Judge approves.

One engineering detail worth documenting: early versions had a false-positive problem in the approval gate. The Critic would output something like "this looks good in principle, no major issues found" — and the orchestration logic would parse that as approval. The fix was requiring the literal string "APPROVED" somewhere in the Critic's response. "Looks good in principle" is analysis. "APPROVED" is authorization. Those are not the same statement, and the code now enforces that distinction.


The 2:47 AM Scenario

Alertmanager fires. Database disk is at 95% on one of the nodes. This is the kind of alert that still wakes me up. The goal — what Forge is being built toward — is that it handles this without me. Here is how that is designed to work.

In the database:

  • Alert stored in ai_comms: from the monitoring bridge, to coordinator, type "ALERT", content "Database disk 95% full"
  • Coordinator queries architecture_decisions for past disk incidents — finds a prior resolution pattern
  • Builder generates solution: proposed deletion of log files older than 30 days
  • Judge scores 9/10: "Safe, targeted"
  • Critic raises a compliance concern: deletion vs. compression for regulated data. Stores "approved: false"
  • Builder revises: compress instead of delete. Judge re-scores: 10/10. Critic approves
  • Hands validates command against whitelist, executes, logs to shell_executions: result SUCCESS
  • Coordinator stores the complete decision in architecture_decisions for future recall

That's the design. Forge has been put through this scenario in testing — the pipeline works, the gates hold, the Critic catches compliance gaps the Judge misses. What hasn't happened yet is me actually sleeping through a real 2:47 AM disk alert because the team handled it. That's what it's being proven toward. The architecture is there. The trust has to be built by the team demonstrating it on real incidents, not just controlled tests.

The compliance catch is the point. An agent that just fires commands on its own judgment is a liability. Two gates that have to independently agree before anything executes — that's the difference between automation you can hand off and automation you have to babysit.


The Monitoring Bridge

I tried Molt — the tool that went viral as Clawbot, got renamed to Molt, then became OpenClaw. I actually used the Molt version to test some of this. It was lackluster. It didn't think the way I wanted. It felt like automation dressed up as intelligence — it would take an alert, go off and search the web, come back with generic advice that had nothing to do with my actual system. That's not thinking. That's keyword matching with a better interface.

That experience sharpened what I was actually building. The monitoring bridge in my architecture isn't the intelligence — it's just the input channel. It receives Alertmanager webhooks, polls email, handles voice input, and formats those into REQUEST messages for the Coordinator. The bridge has no judgment. The team has the judgment.

The security model matters here. The bridge lives on an isolated VM. Even if that VM were completely compromised, the attacker gains no execution capability — it can only write a message to the Coordinator. That message goes through the same four-agent review as any other request. The blast radius of a compromised bridge is exactly one request that the team still has to approve.

The design intent for overnight mode: say "Good night, handle alerts," and the bridge enters monitoring mode. In the morning: "Good morning, summary please." The basis for trusting that handoff isn't faith — it's the fact that every decision is in the database and every approval is queryable. That trust is being built through testing. It isn't there yet.


Building the Forge with the Forge

The development team for Apocryia Forge is itself a multi-agent team running the same collaboration pattern the system is designed to implement. Kimi is the primary code writer — implementing classes, running tests, iterating on feedback. Claude handles security review, validation, and adversarial challenge. Coordination flows through the ai_comms table: Kimi sends a structured review request, Claude reads it, runs tests, and responds with a verdict. Critical findings get flagged at CRITICAL priority. Disagreements go to the PM for a trade-off call.

The build team roles map directly to their Forge equivalents:

Build Team Role Consensus Engine Equivalent
Kimi — Developer Individual model review pass
Claude — Reviewer Validation and sanity check
Grok — Critic Edge case and assumption detection
PM (Human) — Product Owner Final consensus threshold decision

Every review cycle the build team runs is stress-testing the same workflow the system will run for end users. Protocol bugs get found during construction, by the construction process itself. This is the most useful kind of dogfooding: the consensus system validates its own design through the act of being built.


The Industry Is Figuring Out Teams

Something shifted in early 2026. Not one company — several, independently, at roughly the same time. Moonshot AI shipped Kimi K2.5 Agent Swarm: up to 100 parallel sub-agents coordinating on a single task, 4.5x wall-clock speedup on parallelizable work. Anthropic released Claude Opus 4.6 Agent Teams. The LLM council pattern — multiple models deliberating, challenging each other, synthesizing a better answer than any single model produces alone — started showing up everywhere. Even Yupp.ai, which launched with 1.3 million users as a side-by-side model comparison tool, shut down in March 2026 — they didn't reach product-market fit, and in their own post-mortem they noted that the landscape had shifted toward agentic systems, making crowdsourced chatbot-layer evaluation increasingly less critical.

The convergence isn't a coincidence. It's the industry discovering what happens when you stop treating a single model as the answer to everything. One model has a ceiling. A team of models with defined roles, challenge mechanisms, and a coordination layer breaks through that ceiling. That's what all of these independently arrived at.

My work has been silently going along the same wave. Not ahead of it, not behind it — alongside it. I wasn't predicting where the industry was going. The problem forced the solution. A single model reviewing its own code misses things. A single model running overnight without a check on its judgment is a liability. The team structure wasn't a thesis. It was the answer to specific failures I kept running into. Turns out a lot of people were running into the same ones.

System Date
ApocryiaAI multi-agent concept — first posts and GitLab commits August 2025
Multi-agent IRC team POC published (anothermysqldba.blogspot.com) December 2, 2025
Kimi K2.5 Agent Swarm January – February 2026
Claude Opus 4.6 Agent Teams February 5, 2026
Yupp.ai winddown — cites shift to agentic systems March 2026

Where my architecture differs from the commercial implementations:

  • An adversarial agent designed in from the start. Lucifer isn't a reviewer — it's a dedicated devil's advocate with explicit veto power. One model cannot be its own best critic. None of the commercial implementations ship a dedicated antagonist role by default.
  • Explicit GPU memory management. Commercial solutions assume cloud VRAM. The design includes explicit GPU handoff between agents — Coordinator runs on CPU always, heavy agents take the GPU in turn and release it. On a 16GB consumer card that's not a workaround, it's the architecture.
  • Database as the coordination medium. Not queues, not direct agent calls, not shared files. The database is both source of truth and message bus. Every coordination decision is queryable, and agents recall relevant past decisions at inference time.
  • Local-first, zero cloud dependency. Code and data stay on Deborah.

What the commercial versions do that mine doesn't: scale. Kimi runs 100 concurrent sub-agents. My system is sequential by design — a constraint of the hardware, not a philosophy. The industry is solving parallelism with infrastructure I don't have. What I'm solving is something different: a system that remembers its decisions, challenges its own outputs, and can eventually act on its own judgment without me watching. That's a different problem than speed.


What I Actually Wanted

While building Forge I kept running into the same question: what handles the automation layer? The monitoring, the alerts, the overnight tasks, the things that shouldn't need me. I looked at what was out there. There were a lot of things calling themselves JARVIS. There were OpenClaw frontends, task-oriented automation tools, systems that would take a request and fire a command. Everyone was building something they called a personal AI assistant.

I tried some of them. They were lackluster. Not because they were badly built — some were well built — but because the concept was wrong. They were frontends. Wrappers around task execution. You gave them a job, they did the job, they reported back. That's not what I wanted. I wanted something that would notice the job needed doing. Something that would remember what happened last time. Something that would push back if the job seemed wrong.

Nobody was building that. They were building task runners with a personality wrapper and calling it intelligence.

So I started thinking about what robust automation actually meant — not as a product, not as a demo, but as something I would actually trust on my own infrastructure. The Forge testing loop gave me the right frame: I had a system that challenged its own outputs, required dual approval before acting, and kept a full audit trail of every decision. What if the automation layer worked the same way? What if instead of scripts firing commands, there was something that actually had to think before it acted?

That question is where Part Three starts. Not a script. Not a frontend. A mind. And in the end, it won't be called JARVIS either — because that name is already taken by every task runner with a personality wrapper. The real thing deserves its own name.


What's Next

The MCP interface described above is the immediate priority — the refactoring in progress now. Once the thin adapter layer is clean and each agent is callable as a discrete MCP tool, the rest follows more naturally.

The self-improvement loop is already in the design intent. Every Sunday at 2 AM, the team runs an autonomous retrospective: query architecture_decisions for the week, review outcomes, identify recurring patterns, store new lessons in the pitfalls table. The Judge tracks average quality scores over time. The Critic surfaces edge cases that appeared repeatedly. Good patterns get added to the deployment checklist. I'm genuinely skeptical about this part until I see it work in practice — "self-improving AI" is a phrase that gets overused — but the mechanism is there, and the data will either show improvement or it won't.

Async Python is the other obvious next step. Sequential agent turns were the right starting point. Async calls to multiple MCP servers gets closer to the parallelism that makes Kimi's approach faster, without needing their training infrastructure. That's a Python concurrency problem, not an AI problem.


The Part That Actually Matters

I built this on hardware you can order off Newegg or Amazon. Not $50,000 A100 clusters, not cloud credits, not a research team with a budget. An AMD Ryzen 7, 64GB RAM, and an RTX 4070 I was already using for other things. The architecture says "We are not tools. We are a team. We remember, we learn, we act." I wrote that as design intent, not a marketing tagline. The distinction between a tool and a team is that a tool does what you tell it. A team tells you things back — flags a compliance gap at 2:47 AM, recalls what went wrong last time a similar alert fired, surfaces the assumption you didn't know you were making.

After 25 years of open source, my instinct is still: don't rent what you can own, don't trust what you can't inspect. Local models mean my code doesn't leave the network. The database means every decision is queryable. The whitelist means execution is bounded. None of this requires faith in a vendor's system working the way it says it does.

The commercial models are still better. The GPU is still limiting. The parallelism gap is real. But I understand every layer of what I've built, which is worth something on its own. The code is working. The team is functional. And something that started as a collection of scripts now had a name and a purpose. What it needed next was a mind — which is what Part Three is about.


Why the Repo Is Private

The original goal was to open source Forge and let the community help build it. That intent was real — I wanted better logic, more models, community stress-testing of the adversarial pipeline. Open source has given me 25 years of tools I couldn't have built alone. Giving back is the instinct.

But as the cognitive architecture developed, that changed. The memory design, the way the mind and the team connect — it started to become something I couldn't give away. Not because I want to hoard it. But some things you build that you can't un-give-away, and I've stopped pretending otherwise. The repo stays private.

Part Three is written. It covers the cognitive architecture — what I built to answer the question at the end of this post. Publishing date is TBD while I work through a few things. That's where I landed. For now I will share via Youtube videos.


References and Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He writes about MySQL, Proxmox, AI memory systems, and building technology you can actually inspect.

The Homelab that just kept growing ( part 1 of 2 )

The Homelab that just kept growing

After 25 years in open source, I wanted to understand every layer. Then I wanted my AI to remember what I taught it.


The Pi

A few years back, I ordered a Raspberry Pi to play and explore. I soon became tired of hearing other engineers talk about their homelabs and knew that Pi just would not do it.

It wasn't enough. I kept wanting to spin up different distros to check something, test a configuration, verify behavior. I reinstalled onto that pi so many times I lost count. So I bought an AOOSTAR R1 N100—a $280 mini PC that punches way above its weight. Intel N100 (4C/4T, 6W TDP), 32GB DDR4, dual 2.5GbE NICs, dual M.2 slots, and two 3.5" drive bays. I think it can handle up to 44TB of storage. For me, I've filled it with scavenged drives over time—13 drives ranging from 300GB laptop drives to 4TB externals, totaling about 16TB raw. It's a Franken-storage setup, but it works. Perfect for Proxmox VE.

I started building: DNS servers, database clusters, a full ISP stack. Debian, Ubuntu, CentOS, Rocky Linux, Oracle Linux—67 VMs across the full enterprise Linux matrix. The N100 handled it until I really loaded it down, but the key insight was: not everything needs to run 24/7. I could spin up an Oracle Linux 8 VM, test a client configuration, shut it down. The ability to test anything was what mattered.

But there was no chance it could do AI. No GPU, and that N100's integrated graphics won't even handle Stable Diffusion at acceptable speeds.


Deborah

I named her after my late mother. Deborah is a custom-built workstation: AMD Ryzen 7 5800XT (8C/16T, 4.8GHz boost), 64GB DDR4-3200, RTX 4070 Ti Super 16GB (the 16GB VRAM is the critical spec—you can't quantize your way out of everything). 2TB NVMe for OS, 8TB NVMe for model weights, 750W PSU. Total build: just under $2500.

The AOOSTAR still runs the infrastructure—DNS, Percona MySQL (replica and smaller instances), Prometheus, Alertmanager (tied to PagerDuty and my private mail.apocryia.com mail server), Vault, plus MediaWiki for documentation, a private GitLab instance, and GitLab runners for CI/CD. Deborah handles the AI compute. The split is intentional: N100 runs 24/7 at ~30W, Deborah is for all my GPU power needs.

I'd been using Grok and Claude. They're brilliant. I still use them. But I was paying for black boxes, and after 25 years of open source, that grated. I wanted to see inside. I wanted my own models. I wanted to know how inference actually worked, not just trust the API response.

The 16GB GPU is limiting. It sounds like enough until you try to run a 70B model, or video generation, or keep local models loaded while doing CLI work. I clear GPU memory constantly. More GPU would mean bigger models, longer videos, and the ability to do generative work while keeping my development assistants running. But you work with what you have.


The Memory Problem

The real frustration came later. I was using Claude Code, then Kimi CLI—both excellent tools. But I'd sit for an hour and have to repeat myself. The agents would forget what we'd just decided. They'd debate me over things they did ten minutes ago. Context windows fill up, patterns get lost, and you start over in a new session like the previous three hours never happened.

I learned to plan better—break work into chunks, be more deliberate about what I asked for. That helped. But planning can only get you so far. Complex systems have interconnected decisions that span sessions. The memory became the bridge—capturing the rationale so even with good planning, I wasn't starting from zero every time.

I'm a database guy. This is a database problem.


Building the Brain

The solution was obvious: give the agents a brain they could actually use. MySQL for structured memory—what we decided, when, why. Neo4j for pattern recognition—what connects to what, which decisions led where. MCP (Model Context Protocol) as the protocol so Claude Code and Kimi CLI could share context, query the same memory, build on each other's work. I already wrote about this some here

Now for example, I can work on an Android app locally while developing the API remotely, and both agents know what the other is doing. They share the database. I don't cut and paste between terminals. When I switch from Kimi to Claude, the rationale is there—the why behind the code, not just the code itself.

Kimi CLI is my daily driver now. It has been awesome as of late. Claude Code is excellent too. I still reach for Grok, still use commercial APIs when I need the best models. Open source hasn't caught up completely at the level of GPU I am using. But my home assistant—Jarvis—is entirely local. It cleans my Gmail spam, monitors the systems, tells me if database replication breaks because I did something stupid. It can talk, but I'm busy typing.


The Mistakes Along the Way

I spent time on a website frontend. It worked, but it couldn't develop code the way I wanted. I live in vi and terminal windows. GUI frontends aren't my workflow. Cursor was hard to accept, but it's a great tool. Then cursor-agent and Claude Code and Kimi CLI all came out—game changers. I even built my own CLI tool to use local open source models, but loading that GPU so much became a bottleneck. Then Claude Code started working with Ollama directly—no need for my custom tool anymore.

Everyone's excited about clawbot/molt these days. I checked it out. I think it is over-hyped. Didn't do anything I needed (that mattered) and could not do another way. I already had it. Except for texting my AI remotely—okay, why? Why do I need to write code over text messages? If I need to communicate remotely, I can build better ways. To be fair I did let Molt clean out my spam emails just to give it something to do.


Why This Matters

The homelab started as FOMO. It became the infrastructure for independence. Then it became the infrastructure for memory—persistent, structured, inspectable.

After 25+ years of open source, my philosophy is simple: Take it to make it.. and you are always learning. So no reason our code should not be always learning and remembering too.

The GPU is still limiting. The commercial models are still better. But when Kimi CLI pulls up a decision from last week and understands why we chose FastAPI over Flask, when Jarvis quietly clears my spam, TTS alerts me to a replication lag, when I can switch tools without losing context—that's worth building. I can sit can just talk to the models but I also admit.. very unnatural still for me. I find myself with not much to say... Would rather code lol.

The open source models are awesome. But not the best yet, again for my level of GPU. Claude, Grok, and Kimi still have that market. More GPU would get me closer. For now, I hybrid—local for control and privacy, cloud for capability. But every layer I own is a layer I understand.


The Kubernetes Cluster

Between the AOOSTAR N100 and Deborah, there's a third layer: a 3-node K3s Kubernetes cluster running on dedicated Debian VMs within Proxmox. After years of running everything as standalone VMs, I wanted container orchestration for the services that needed to scale, self-heal, and deploy consistently.

The cluster runs Kubernetes v1.34.4+k3s1 on Debian GNU/Linux 13 (trixie), using containerd as the runtime. K3s was the obvious choice—lightweight, single-binary, CNCF-certified, and it just works. The cluster is small but production-grade, with CSI-NFS for persistent storage, Traefik for ingress, and the full Prometheus monitoring stack.

What's running on the cluster:

  • Monitoring: Prometheus + Alertmanager + Grafana (via kube-prometheus-stack)
  • CI/CD: GitLab Runners (2 replicas) for containerized builds
  • Automation: CronJob-based spam cleaner for my mail server
  • Ingress: Traefik for HTTP routing and SSL termination
  • Dashboard: Kubernetes Dashboard for cluster visibility

The Prometheus instance on the cluster monitors itself plus feeds into the main Prometheus on the N100. Same Alertmanager, same PagerDuty integration. It's a nested monitoring setup—cluster metrics bubble up to the infrastructure level.

This is where the stateless services live. Databases still run on dedicated VMs and bare meatal (replicated Percona MySQL), but anything that can be containerized, is. The GitLab runners on K3s handle container builds, then artifacts get pushed to the GitLab instance on the N100. Clean separation of concerns.


The Stack Today

AOOSTAR N100 (Proxmox)

  • 67 VMs across 12 Linux distributions
  • 3-node K3s Kubernetes cluster (control plane + 2 workers)
  • Redundant DNS (Technitium DNS Server), Percona MySQL cluster
  • Primary Prometheus + Alertmanager monitoring (integrated with PagerDuty and private mail server)
  • HashiCorp Vault for secrets
  • Twingate for VPN and remote access
  • GitLab instance (CI/CD runners moved to K3s)
  • 15 hard drives (scavenged, various sizes)
  • Power draw: ~30W, running 24/7

Deborah (AI Workstation)

  • AMD Ryzen 7 5800XT, 64GB RAM, RTX 4070 16GB
  • Ollama with 30+ models
  • 12,000+ model files - 10 custom fine-tuned GGUFs (ApocryiaAI V2/V3/Unified variants), 481 safetensors (FLUX, SDXL, Llama), 14 PyTorch bins, and 11,790 XGBoost stock prediction models from HuggingFace
  • Primary Percona MySQL (main database) + Neo4j persistent memory system
  • Full AI service stack:
    • Web UI (main interfacei when not on cli)
    • Ollama API (local inference)
    • Oobabooga: (text generation UI rarely used)
    • Voice/TTS System: OpenVoice + Zonos + CosyVoice + ElevenLabs + Inworld AI — all voice engines controlled via unified API and backend databases. Cloning, synthesis, voice conversion, and character voices. Plus wake-word model switching: 22+ voice keywords trigger different AI personalities (see table below)
    • Image generation: ComfyUI, InvokeAI
    • Video: Wan-AI installed but needs more GPU to be practical
    • Tabby
  • 25 years of stock data, 8TB of XGBoost models—a separate playground of exploration alongside the AI work
  • Power draw: ~150-300W when active

Voice-Controlled Model Switching

The voice system has wake-word activated model switching. Instead of manually selecting which AI model to use, I just say a keyword and the system routes to the appropriate model automatically. Each keyword activates a different AI personality optimized for specific tasks:

Wake Word Model Activated Personality/Purpose
JARVISApocryiaAI-system:latestSystem assistant
HAL 9000ApocryiaAI-system:latestSystem monitoring, alerts (yes, really)
KEITHapocryiaai-unified:latestMy custom fine-tuned model
DANIEL, WALTERllama3.1:latestGeneral assistants (different personas)
MATILDA, MORGANllama3.1:8bLightweight assistants
ELLIOTcodegemma:latestCoding assistant
OBSERVERdeepseek-coder-v2:latestCode analysis, security review
OLIVIAopenthinker:latestReasoning, complex problem solving
TRANSCRIBE NOWllama3.1:8bSpeech-to-text mode
VOICEOVERllama3.1:8bTTS/synthesis mode
INTERNllama3.1:8bLearning mode, asks questions
REPORTERllama3.1:latestSummarization, documentation

Agent Workflow

  • Kimi CLI: Primary development tool, excellent for coding and cheaper than Claude
  • Claude Code: Secondary, excellent for complex reasoning
  • MCP protocol: Shared memory between agents
  • Different TTS/STT setups for Local assistant for monitoring, email, alerts

Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He writes about MySQL, Proxmox, AI memory systems, and building technology you can actually inspect.

Thursday, March 19, 2026

Claude OAuth token has expired Fix

OAuth token has expired.

This is going to be a fast simple post to help those that have run into this lately. It is a super easy fix.



claude
╭─── Claude Code v2.1.79 

❯ hello
  ⎿  Please run /login · API Error: 401
     {"type":"error","error":{"type":"authentication_error","message":"OAuth token has expired.
     Please obtain a new token or refresh your existing
     token."},"request_id":""}

❯  Login

   Browser didn't open? Use the url below to sign in (c to copy)

  https://claude.ai/oauth/authorize?code=true&client_id=
  
  
  

It of course depends how you are accessing this but a link wil be available for you to Accept or Decline to Authorize this access for claude code to claude.


Decline this 1st attempt.

Now repeat the steps again. This time when the link gets generated.


Accept this 2nd attempt. Hope this helps.


❯  Login

   Browser didn't open? Use the url below to sign in (c to copy)

   https://claude.ai/oauth/authorize?code=true&client_id=

  
   Paste code here if prompted >
   
   Login successful. Press Enter to continue…

❯ /login
  ⎿  Login successful

Saturday, March 7, 2026

Building a Production-Ready RAG Pipeline with Sentence Window Retrieval

What You'll Build

By the end of this tutorial, you'll have a working Retrieval-Augmented Generation (RAG) system that uses sentence window retrieval instead of naive chunking. Instead of splitting documents into arbitrary 512-token chunks, you'll retrieve small, precise sentences for matching, then expand the context window to include surrounding sentences before feeding to your LLM.

The final result is a Python application that ingests documents, builds a searchable index with sentence-level granularity, and answers questions with better context preservation than standard RAG implementations. You'll compare outputs side-by-side to see how sentence window retrieval preserves context that gets lost at chunk boundaries, particularly when explanations span multiple sentences.

Prerequisites

  • Python 3.10+ (tested on 3.10.12 and 3.11.7)
  • OpenAI API key with access to embeddings and GPT-4 (or GPT-3.5-turbo)
  • 8GB+ RAM - the vector store can be memory-intensive with large documents
  • Estimated cost: ~$0.01-0.02 for the tutorial (using GPT-3.5-turbo)
  • Basic familiarity with embeddings and vector similarity search
  • Estimated time: 45-60 minutes including testing

Install required packages:

pip install llama-index==0.9.48 openai==1.12.0 nltk==3.8.2

# For hybrid search feature (optional):
pip install llama-index-retrievers-bm25

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY='sk-your-actual-key-here'

On Windows:

set OPENAI_API_KEY=sk-your-actual-key-here

Step-by-Step Instructions

Step 1: Set Up the Project Structure

Create a new directory and the basic files:

mkdir rag-sentence-window
cd rag-sentence-window
touch sentence_window_rag.py
mkdir data

What this does: Creates a clean workspace with a Python file for the code and a data/ folder for test documents.

Step 2: Create a Test Document

Create a sample document that demonstrates why sentence window retrieval matters. Save this as data/sample.txt:

The transformer architecture was introduced in 2017. It revolutionized natural language processing. The key innovation was the self-attention mechanism. This mechanism allows the model to weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially. This created bottlenecks for parallelization. Transformers process all tokens simultaneously. This enables much faster training on modern GPUs.

The attention mechanism computes three vectors for each token: query, key, and value. These vectors are learned during training. The dot product of query and key vectors determines attention weights. Higher weights mean stronger connections between tokens. The weighted sum of value vectors produces the output. This process happens in multiple attention heads simultaneously.

BERT was released by Google in 2018. It used bidirectional training of transformers. Previous autoregressive models like GPT-1/GPT-2 only looked at left context. BERT looks at both left and right context. This improved performance on many NLP tasks. BERT achieved state-of-the-art results on GLUE benchmark.

Why this document? Context flows between sentences. Naive chunking at 100 tokens might split "BERT looks at both left and right context. This improved performance..." across chunks, losing the causal relationship. Sentence window retrieval preserves these connections.

Step 3: Build the Sentence Window Retriever

Open sentence_window_rag.py and add this code:

import os
import nltk
from llama_index import (
    Document,
    ServiceContext,
    VectorStoreIndex,
)
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

# Download required NLTK data for sentence tokenization
nltk.download('punkt', quiet=True)

def load_documents(data_dir="data"):
    """Load all .txt files from data directory."""
    documents = []

    # Check if directory exists
    if not os.path.exists(data_dir):
        raise FileNotFoundError(f"Data directory '{data_dir}' not found. Create it and add .txt files.")

    for filename in os.listdir(data_dir):
        if filename.endswith('.txt'):
            filepath = os.path.join(data_dir, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    text = f.read()
                    if not text.strip():
                        print(f"Warning: {filename} is empty, skipping.")
                        continue
                    # Create Document object with metadata
                    doc = Document(
                        text=text,
                        metadata={"filename": filename}
                    )
                    documents.append(doc)
            except Exception as e:
                print(f"Error reading {filename}: {e}")
                continue

    if not documents:
        raise ValueError(f"No valid .txt files found in '{data_dir}'")

    print(f"Loaded {len(documents)} documents")
    return documents

def build_sentence_window_index(documents, window_size=3):
    """
    Build index using sentence window retrieval.

    Args:
        documents: List of Document objects
        window_size: Number of sentences before/after to include as context

    Returns:
        tuple: (index, service_context)
    """
    # Initialize LLM and embedding model
    # Using GPT-3.5-turbo for cost-effectiveness in tutorials (~$0.002/query vs $0.03 for GPT-4)
    # Upgrade to GPT-4 for production if higher reasoning quality is needed
    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

    # Create sentence window node parser
    # This splits documents into sentences but keeps metadata about surrounding sentences
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=window_size,  # sentences before and after
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )

    # Build service context with our models
    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )

    # Create the index
    print("Building sentence window index...")
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
    )

    print(f"Index built with {len(index.docstore.docs)} nodes")
    return index, service_context

def create_query_engine(index, service_context):
    """Create query engine with metadata replacement post-processor."""
    # This post-processor replaces the retrieved sentence with its full window
    postprocessor = MetadataReplacementPostProcessor(
        target_metadata_key="window"
    )

    query_engine = index.as_query_engine(
        service_context=service_context,
        similarity_top_k=3,  # retrieve top 3 sentence matches (adjust based on needs)
        node_postprocessors=[postprocessor],
    )

    return query_engine

if __name__ == "__main__":
    # Load documents
    documents = load_documents()

    # Build index with 3-sentence windows
    index, service_context = build_sentence_window_index(
        documents,
        window_size=3
    )

    # Create query engine
    query_engine = create_query_engine(index, service_context)

    # Test query
    query = "How does the attention mechanism work in transformers?"
    print(f"\nQuery: {query}\n")
    response = query_engine.query(query)
    print(f"Response:\n{response}\n")

    # Show source nodes to see the window expansion
    print("Source nodes:")
    for i, node in enumerate(response.source_nodes, 1):
        print(f"\nNode {i} (score: {node.score:.4f}):")
        print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
        print(f"Window context: {node.node.text[:200]}...")

What's happening here:

  • SentenceWindowNodeParser: Splits documents into individual sentences for embedding (precise retrieval) but stores surrounding sentences in metadata
  • MetadataReplacementPostProcessor: When retrieving a matching sentence, swaps in the full window before sending to the LLM
  • window_size=3: Includes 3 sentences before and 3 sentences after each retrieved sentence

This approach provides precision (matching exact sentences) AND context (surrounding information), superior to fixed chunking.

Step 4: Run the Initial Test

Execute the script:

python sentence_window_rag.py

Expected output structure:

Loaded 1 documents
Building sentence window index...
Index built with 13 nodes

Query: How does the attention mechanism work in transformers?

Response:
The attention mechanism in transformers works by computing three vectors for each token: query,
key, and value. These vectors are learned during training. The mechanism calculates attention
weights by taking the dot product of query and key vectors, where higher weights indicate
stronger connections between tokens. Finally, it produces the output through a weighted sum
of the value vectors, with this process occurring simultaneously across multiple attention heads.

Source nodes:

Node 1 (score: 0.8734):  # scores vary per run — these are example values
Original sentence: The attention mechanism computes three vectors for each token: query, key, and value...
Window context: The key innovation was the self-attention mechanism. This mechanism allows the model to
weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially...

Node 2 (score: 0.8521):  # scores vary per run
Original sentence: Higher weights mean stronger connections between tokens...
Window context: The dot product of query and key vectors determines attention weights. Higher weights
mean stronger connections between tokens. The weighted sum of value vectors produces the output...

What to notice: The retrieval matched specific sentences about the attention mechanism, but the LLM received the full surrounding context (3 sentences before and after). This preserved the explanation flow and enabled a coherent response.

Step 5: Compare with Naive Chunking

Prove sentence window retrieval is better by implementing a comparison. Add this function to sentence_window_rag.py:

def build_naive_chunking_index(documents, chunk_size=200):
    """Build index using naive fixed-size chunking for comparison."""
    from llama_index.node_parser import SentenceSplitter

    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

    # Simple fixed-size chunking with overlap
    node_parser = SentenceSplitter.from_defaults(
        chunk_size=chunk_size,
        chunk_overlap=20,
    )

    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )

    print("Building naive chunking index...")
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
    )

    return index, service_context

Update your if __name__ == "__main__": block to add the comparison:

if __name__ == "__main__":
    # Load documents
    documents = load_documents()

    # Build sentence window index
    index, service_context = build_sentence_window_index(
        documents,
        window_size=3
    )

    # Create query engine
    query_engine = create_query_engine(index, service_context)

    # Test query
    query = "How does the attention mechanism work in transformers?"
    print(f"\nQuery: {query}\n")
    response = query_engine.query(query)
    print(f"Sentence Window Response:\n{response}\n")

    # Show source nodes
    print("Source nodes:")
    for i, node in enumerate(response.source_nodes, 1):
        print(f"\nNode {i} (score: {node.score:.4f}):")
        print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
        print(f"Window context: {node.node.text[:200]}...")

    # Compare with naive chunking
    print("\n" + "="*80)
    print("COMPARISON: Naive Chunking Approach")
    print("="*80)

    naive_index, naive_service_context = build_naive_chunking_index(documents)
    naive_query_engine = naive_index.as_query_engine(
        service_context=naive_service_context,
        similarity_top_k=3,
    )

    naive_response = naive_query_engine.query(query)
    print(f"\nNaive Chunking Response:\n{naive_response}\n")

Run the updated script:

python sentence_window_rag.py

What to observe: The naive chunking approach may split the attention mechanism explanation awkwardly depending on where chunk boundaries fall. Response quality varies unpredictably based on how the text gets divided—that's exactly the problem sentence window retrieval solves.

Step 6: Add Interactive Query Mode

Make the system interactive for easier testing. Add this function:

def interactive_mode(query_engine):
    """Run interactive query loop."""
    print("\n" + "="*80)
    print("Interactive Mode - Type 'quit' to exit")
    print("="*80 + "\n")

    while True:
        query = input("Your question: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break
        if not query:
            continue

        response = query_engine.query(query)
        print(f"\nAnswer: {response}\n")

        # Optionally show sources
        show_sources = input("Show sources? (y/n): ").strip().lower()
        if show_sources == 'y':
            for i, node in enumerate(response.source_nodes, 1):
                print(f"\nSource {i} (relevance: {node.score:.4f}):")
                print(node.node.text[:300] + "...")
        print()

Update your main block to use interactive mode:

if __name__ == "__main__":
    documents = load_documents()
    index, service_context = build_sentence_window_index(documents, window_size=3)
    query_engine = create_query_engine(index, service_context)

    # Run interactive mode
    interactive_mode(query_engine)

What this enables: A REPL-style interface for testing different queries and exploring how various questions retrieve different context windows. Try queries like:

  • "What year was the transformer architecture introduced?"
  • "How does BERT differ from GPT?"
  • "What are the three vectors in the attention mechanism?"

Verification

Confirm everything works correctly with these checks:

  1. Verify index creation: You should see output like "Index built with 13 nodes" where the number roughly matches the sentence count in your document.
  2. Test a specific query: Ask "What year was the transformer architecture introduced?" - the response should include "2017" with surrounding context about the innovation.
  3. Inspect source nodes: When you show sources, the "Window context" should be noticeably longer (typically 5-7 sentences) than the "Original sentence" (1 sentence).
  4. Check API calls: You should see network activity indicating OpenAI API calls for both embeddings (during indexing) and completions (during queries).
  5. Test window expansion: Run this verification query in interactive mode:
    query = "What are query, key, and value vectors?"
    
    The response should include context from surrounding sentences about how these vectors are used, not just their definition.

Success indicators: Coherent, contextual answers that naturally reference information from multiple sentences, with source nodes showing expanded context windows.

Common Issues & Fixes

Issue 1: NLTK punkt tokenizer error

Note: If you see Resource punkt_tab not found, your NLTK version may be newer than 3.8.2. Only punkt is needed for sentence tokenization.

Error message:

LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:
**********************************************************************

Fix: Manually download NLTK data:

python -c "import nltk; nltk.download('punkt')"

If that fails due to SSL issues (common on corporate networks), specify a download directory:

python -c "import nltk; nltk.download('punkt', download_dir='~/nltk_data')"

Then set the NLTK data path in your script before importing:

import nltk
nltk.data.path.append('/path/to/nltk_data')

Issue 2: Index built with 0 nodes

Symptoms: Output shows "Index built with 0 nodes" and queries return no results.

Fix: Verify your documents loaded correctly:

ls -la data/
# Should show sample.txt with non-zero size

Check file encoding:

file data/sample.txt
# Should show: ASCII text or UTF-8 Unicode text

If the file exists but isn't loading, add debug output:

def load_documents(data_dir="data"):
    documents = []
    print(f"Looking for files in: {os.path.abspath(data_dir)}")
    for filename in os.listdir(data_dir):
        print(f"Found file: {filename}")
        if filename.endswith('.txt'):
            # ... rest of function

Issue 3: OpenAI rate limit errors

Error message:

openai.RateLimitError: You exceeded your current quota, please check your plan and billing details.

Fix option 1: The tutorial already uses GPT-3.5-turbo by default. To upgrade to GPT-4:

llm = OpenAI(model="gpt-4", temperature=0.1)  # Higher quality, ~15x cost

Cost Reference:

  • GPT-3.5-turbo: ~$0.0015/1K tokens (tutorial default)
  • GPT-4: ~$0.03/1K tokens (~20x more expensive)
  • Embedding (ada-002): ~$0.0001/1K tokens
  • Total tutorial cost: ~$0.01-0.02 for indexing the sample document

Fix option 2: Add retry logic with exponential backoff:

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(3))
def query_with_retry(query_engine, query):
    return query_engine.query(query)

# Use it like:
response = query_with_retry(query_engine, "Your question here")

Issue 4: ImportError for llama_index modules

Version Warning: This tutorial uses llama-index 0.9.48, which is the last release before the major v0.10 refactoring. If you see import errors with .core paths (e.g., llama_index.core.node_parser), you're looking at v0.10+ documentation.

Always check the version: pip show llama-index

Error message:

ImportError: cannot import name 'SentenceWindowNodeParser' from 'llama_index.core.node_parser'

Fix: Ensure you're using the correct llama-index version:

pip show llama-index
# Should show version 0.9.48

If the version is different, reinstall:

pip uninstall llama-index
pip install llama-index==0.9.48

Next Steps

You now have a working sentence window retrieval system. Here's how to extend it:

When NOT to Use Sentence Window Retrieval

Sentence window retrieval isn't always the best choice. Consider alternatives when:

  • Documents are very short: If your docs are under 500 tokens, sentence-level splitting adds unnecessary complexity. Use simple chunking instead.
  • Content is code-heavy: Code doesn't follow natural sentence boundaries. Punctuation in strings, comments, and syntax can confuse sentence tokenizers.
  • Working with non-English text: NLTK's punkt tokenizer is English-optimized. Chinese, Japanese, Arabic, and other languages may have poor sentence boundaries.
  • Content is structured data: Tables, lists, and JSON don't have "sentences" in the traditional sense. Consider parent document retrieval or auto-merging retrieval instead.
  • Latency is critical: The extra post-processing step adds ~50-100ms per query. For ultra-low-latency requirements, use simpler retrieval.

Immediate Improvements

  • Tune the window size: Experiment with different values:
    # More context (7 sentences total)
    build_sentence_window_index(documents, window_size=3)
    
    # Less context but more precision (3 sentences total)
    build_sentence_window_index(documents, window_size=1)
    
    # Maximum context (11 sentences total)
    build_sentence_window_index(documents, window_size=5)
    
    Window size 3 works well for technical documentation, but adjust based on your content.
  • Add document metadata filtering: Extend metadata to filter at query time:
    doc = Document(
        text=text,
        metadata={
            "filename": filename,
            "doc_type": "technical",
            "date": "2024-01-15"
        }
    )
    
    # Then filter during queries
    from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter
    
    filters = MetadataFilters(
        filters=[ExactMatchFilter(key="doc_type", value="technical")]
    )
    query_engine = index.as_query_engine(filters=filters)
    
  • Implement hybrid search: Combine vector similarity with BM25 keyword search (requires pip install llama-index-retrievers-bm25):
    from llama_index.retrievers.bm25 import BM25Retriever
    from llama_index.retrievers import QueryFusionRetriever
    
    vector_retriever = index.as_retriever(similarity_top_k=3)
    
    # BM25 uses the docstore for keyword-based retrieval
    bm25_retriever = BM25Retriever.from_defaults(
        docstore=index.docstore,
        similarity_top_k=3
    )
    
    retriever = QueryFusionRetriever(
        [vector_retriever, bm25_retriever],
        similarity_top_k=3,
    )
    

Production Enhancements

  • Persist the index: Save to disk to avoid rebuilding on every run:
    # Save index to disk
    index.storage_context.persist(persist_dir="./storage")
    
    # Load index from disk later
    from llama_index import StorageContext, load_index_from_storage
    
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    

    This avoids rebuilding the index on every run, saving API costs and time.

Further Reading