Tuesday, May 26, 2026

Claude Code, Kimi, and the Grok CLI That Got Away

Two tools I run every day, one I wanted to love, and the cost model that ended it before it started.


Where This Fits the Story

In Part One I built the hardware and the persistent memory problem. In Part Two I built the adversarial team (Apocryia Forge) and the three-layer architecture. The cognitive architecture work is progressing but won't be written up in detail — that will be shown via video on the YouTube channel instead.

Throughout all of it I have been living in terminals — vi, tmux, multiple SSH sessions, local and remote. GUI frontends never stuck. So when the agentic coding CLIs arrived — Claude Code, Kimi CLI, Cursor Agent, and others — they weren't toys. They became the daily interfaces to the brain and the team. I even explored building my own at one point. The space moves fast enough that evaluating and discarding tools is just part of the workflow now. Grok Build is the latest to land, and this post is partly that evaluation.

I had been waiting specifically for a direct Grok CLI. While it had been in limited testing earlier, the official news and wider availability came on May 25, 2026.

What I'm really doing is running a CLI team with defined roles and a shared project structure. The setup is a shared folder with dedicated agents — DEVELOPER 1 through 5, DBA, PMO, SECURITY, SYSADMIN, REVIEWER, and CRITIC — each with a focused role-specific MD file and a shared scope MD file they all read from. They communicate with each other through a shared MySQL channel, so they're not isolated; they can interact, hand off context, and build on each other's work. The model behind each role is interchangeable — that's the point. The role stays constant. The model powering it can change.

There's even a DEVELOPER_AND_TEAM folder where Claude orchestrates Kimi directly — that was more of a proof of concept and I don't reach for it often, I just go to Kimi. But it showed the approach works. The question stopped being "which CLI is best" and became "which model fits this role today, and does the cost structure support running it the way the role actually needs to be run."

Here's the field report, with the same honesty I apply to everything else in this stack.


The Common Ground (2026 Edition)

All three tools have converged on roughly the same shape:

  • Agentic loop in the terminal (read files, edit, run shell, plan, iterate)
  • Project-level context/rules files (CLAUDE.md, AGENTS.md, or equivalent)
  • Skills / procedures / custom instructions that load on demand
  • Sub-agent or delegation capabilities
  • MCP (Model Context Protocol) support for extending with external tools and memory
  • Headless / one-shot / piping modes for scripting and CI

The fact that three independent teams (Anthropic, Moonshot, xAI) all landed in the same place tells you the problem space is real. The differences are in emphasis, polish, and the specific engineering primitives each one ships with.


Kimi CLI (Moonshot) — Fast, Low-Ceremony Daily Driver

Kimi CLI has become my primary execution engine. For implementation, velocity work, and getting things built, it's where I spend most of my time — with Claude handling planning and review on either side of it.

Strengths that show up in real use:

  • Fast, low-friction iteration — often the quickest path from idea to working code
  • Excellent shell integration (Ctrl-X drops you into a real shell without losing context)
  • Solid MCP support and straightforward tool configuration
  • Good balance of capability and speed for day-to-day engineering

Honestly, Kimi has been great lately. When I need to move fast on internal tooling, the video pipeline, or Android + API work, it gets out of the way effectively and integrates cleanly with the shared memory layer. It also runs on a flat subscription model — I'm on Allegretto ($39/mo), which includes 2× agent credits, 5× Kimi Code credits, and multi-tasking — so heavy agentic use doesn't spin a billing meter. The API exists if you want it, but you don't have to use it. That's not a minor detail; it's a major reason it remains a primary daily driver.

Limitations: On security-critical or very long-horizon architectural work, I still want at least one other strong viewpoint in the mix. Official English developer docs are at platform.kimi.ai/docs and cover models, API, pricing, and best practices well.


Claude Code (Anthropic) — Mature Platform with Real Depth

Claude Code is currently the most polished and feature-complete of the three. It has the widest range of high-quality capabilities out of the box.

Where it consistently delivers:

  • Excellent reasoning depth, especially on security, architecture, and complex refactors
  • Outstanding Unix composability — `git diff | claude -p "review..."` is genuinely useful
  • Mature skills/plugins/MCP story, plus dedicated background agents via the `agents` command
  • Strong session tools (checkpoints, rewind) and granular permission controls
  • The Agents dashboard and remote console features are genuinely valuable when you want to orchestrate or work from elsewhere

I treat it as the senior reviewer on the team for a reason — it still catches things the others miss and explains trade-offs clearly.

That said, it has shown real limitations on certain classes of work. I put serious focus on a complex local model build using Opus 4.6 specifically — multiple planning sessions, deliberate setup, the works. It repeatedly failed. The outputs were coherent enough to keep me moving for weeks before the fundamental problems in the data and architectural framing became undeniable. That cost real time and a real electricity bill on hardware that was running hard the whole time. Kimi is looking good on that same class of work so far, but it's early and time will tell. The broader lesson holds: even a very strong single reasoning engine develops blind spots on long-horizon work, which is exactly why having genuinely different cognitive styles in the mix matters.


Grok Build (xAI) — The One That Deflated Me Before I Even Started

I had been genuinely excited for a direct, official Grok CLI from xAI. Not casually interested — actually waiting for it. The official announcement came on May 25, 2026 with Grok Build entering early beta, and my first reaction was real enthusiasm. The orchestration primitives it ships with looked exactly right for the kind of adversarial, multi-perspective engineering workflow I've been building toward:

  • Native --best-of-n for parallel exploration and strong verification tools (--check, todo gates)
  • Explicit Plan Mode with a user approval gate — this matches how I want serious work structured
  • First-class subagent support with capability modes and worktree isolation
  • Deep controls around memory, sandboxing, and session management

On paper this is everything I'd want in an Engineering Orchestrator. The documentation is outstanding. The design shows clear thinking about how adversarial review and planning gates should work natively in a CLI. I was ready to put in the time to integrate it properly.

An open-source Grok CLI existed before the official one — community-built, functional. I downloaded it, checked it out. But it ran straight against the xAI API and I didn't want that for heavy daily use. So I set it aside and held out for the official release, genuinely expecting xAI to follow the same model that Claude and Kimi had already proven worked: subscription access, no per-token meter on the CLI. That wasn't an unreasonable hope. The standard had already been set. They just didn't follow it.

Then I looked at how the billing actually works.

With Claude Code, I'm on a Max subscription — flat monthly rate, no per-token billing during CLI use. With Kimi, I'm on a flat subscription as well (Kimi Code has tiers from $19/mo up to $99/mo for the Allegro plan with parallel subagents). Both tools let me run hard — heavy subagents, long contexts, lots of tool calls — without watching a meter spin. The cost is known, fixed, and doesn't punish me for using them the way they're designed to be used.

I already have SuperGrok Business at $30/mo, and as of the May 25 launch that subscription does give you Grok Build CLI access. So the door is open. The problem is what happens once you walk through it: CLI usage draws from API credits — and I already use the xAI API for script automations. That spend is already on the bill. The official xAI pricing docs confirm: $1.00/M input, $2.00/M output for grok-build-0.1. Adding heavy agentic CLI use on top of existing API automation costs isn't a new expense in isolation — it dramatically increases a meter that's already running. No ceiling, no flat rate, no "run it as hard as you want."

And then there's the part that really buried it for me: tool invocations are billed separately at $2.50–$10.00 per 1,000 calls. In an agentic CLI session every file read, every shell command, every web search, every MCP tool call is a tool invocation. A heavy engineering session can easily rack up hundreds of those. You're paying for tokens and for each action the agent takes. For the Engineering Orchestrator workload I had planned — subagent spawns, long planning contexts, iterative file operations, MCP calls back to internal tooling — the combined token + tool invocation bill would run up fast and blow past what I pay for Claude Max. The exact workload that most needed Grok's orchestration primitives is the workload that makes the billing land hardest.

That killed it for me, at least for now. It's not that the cost is necessarily prohibitive in absolute terms — it's that the cost model doesn't align with how I use the tools I've actually committed to. I'm not interested in running the math on every heavy session to figure out if I'm about to get a surprise bill. The excitement just… went away.

I'll revisit it when xAI offers a subscription tier that covers CLI use the way Anthropic's Max and Kimi's Allegretto do — a flat rate, no API meter, no surprise overage. The capabilities are genuinely compelling. The cost structure is the only reason they're not in the rotation right now. The $30/mo I already pay for SuperGrok keeps the web Grok available; that's fine. But it's not the same thing as a CLI I can run hard every day.

And it apparently isn't just my experience. The Wall Street Journal reported on May 11, 2026 that Grok lags far behind its fast-growing competitors, with adoption by both business and consumer users having slowed. The kicker: SpaceX signed a deal in early May to rent all the computing capacity at one of Musk's main data centers — to Anthropic. The company racing to catch Claude is handing Claude's maker its own compute to do it with. That raises real questions about whether Grok can still catch up at all.

xAI already has enough controversy without also making it harder for developers who actually want to use Grok to do so. Slowing adoption and a billing model that discourages the heavy use cases that would drive that adoption — those two things are connected. I wanted to be in the "actually using it" column. The cost model put me in the "watching from the shelf" column instead. And apparently I'm not alone.


Head-to-Head Comparison

Aspect Kimi CLI Claude Code Grok Build
Daily velocity Excellent — primary execution engine for implementation and velocity work Excellent — primary for planning, specs, reviews, and hard reasoning Good for complex work
Reasoning depth Strong (K2 models) Best-in-class for hard problems Strong + explicit orchestration
Built-in adversarial / review loops Agentic by default Good via skills + prompting Explicit (Implementer + Reviewer + Critic roles, best-of-n)
Plan / approval gates Planning support Plan review available Dedicated Plan Mode + DAG execution with user sign-off
MCP + custom memory integration Native + straightforward Full support Strong MCP support; not currently integrated into the team stack
Shell / terminal integration Ctrl-X real shell mode (outstanding) Full bash tool + excellent piping Rich interactive terminal experience with strong keyboard support
Extensibility model MCP + plugins + zsh Skills (Agent Skills standard) + subagents Skills + marketplace + hooks + personas + subagents + worktrees
Session power tools Good Checkpoints, rewind, background agents Extremely rich (rewind points, full event history, isolation)
Documentation & discoverability Good English docs at platform.kimi.ai Excellent, mature Outstanding (21+ structured guides + excellent in-terminal discoverability)
Cost model Competitive; reasonable for heavy agentic use Max subscription flat rate — no per-token billing during CLI use API billing per token — heavy agentic use can cost more than Claude
Friction points Less established track record on very long-horizon work OAuth/login ceremony (documented separately) Cost model doesn't match subscription tools; learning curve for full power surface

Cost Reality: The Number That Ended My Excitement

All three tools have meaningfully different billing models. This is worth understanding before you fall in love with the feature list.

Plan Billing Monthly Cost Heavy Agentic Use?
Claude Code (Anthropic)
Pro Flat subscription $20 Limited lower usage ceiling
Max 5×  ← my plan Flat subscription $100 Safe no per-token meter
Max 20× Flat subscription $200 Safe no per-token meter
ChatGPT / Codex (OpenAI) — included for comparison; not in my stack
ChatGPT Plus Flat subscription $20 Limited lower Codex credit allocation
ChatGPT Pro $100 Flat subscription $100 Limited 5× usage; overages fall back to API billing
ChatGPT Pro $200 Flat subscription $200 Limited 20× usage; overages fall back to API billing
Kimi Code (Moonshot AI)
Moderato Flat subscription $19 Limited basic agent credits
Allegretto  ← my plan Flat subscription $39 Safe 2× agent credits, 5× Kimi Code, multi-tasking
Allegro Flat subscription $99 Safe up to 300 parallel subagents
Vivace Flat subscription $199 Safe max tier
K2.6 API (optional) Pay-per-token $0.95/M in
$4.00/M out
No ceiling but very cheap per-token
Grok Build (xAI)
SuperGrok Business  ← my current plan Subscription + API credits $30
+ API usage on top
Risky CLI access included, but usage draws API credits — no flat ceiling
xAI API direct
(what CLI usage draws from)
Pay-per-token + per tool call grok-build-0.1:
$1.00/M input
$0.20/M cached input
$2.00/M output
+ $2.50–$10.00 per 1,000 tool calls
No ceiling tokens AND tool invocations both billed — every file read, shell call, web fetch adds up

The pattern is clear: both Claude Code and Kimi Code have subscription tiers that let you use the CLI without API billing — you don't have to touch per-token rates at all. I don't use the Kimi API. I don't use the Claude API for CLI work — subscription covers that. I do use the Claude API separately for automation work, same as I use the xAI API for script automations. But those are distinct costs, not the CLI meter running. The point is the CLI itself doesn't force you onto per-token billing.

That also means the real comparison for my setup is $100/mo (Claude Max 5×) vs $39/mo (Kimi Allegretto) — and honestly, I get more practical daily usage out of Kimi for that cost. Claude is the stronger tool for hard reasoning and deep review work, and the Max subscription is worth it for those moments. But for volume, day-to-day implementation, and velocity work, $39/mo covers a lot of ground with Kimi.

Grok Build doesn't have that equivalent path. CLI usage draws from API credits — tokens billed at $1.00/$2.00 per million, plus tool invocations billed separately at $2.50–$10.00 per 1,000 calls (confirmed in xAI's official pricing docs). Every file read, shell command, MCP call, and web fetch the agent makes is a tool invocation. A real agentic session stacks both meters simultaneously. For the Engineering Orchestrator workload I had planned, that adds up fast with no ceiling to cap it. That's the number — really, the two numbers — that ended my excitement.


The Real Differentiator: They All Drive the Same Brain

None of this would matter as much if I had to maintain separate context in each tool.

Because the memory layer is the constant, not the CLI. As covered in Part One, the hardware constraints on Deborah — a 16GB GPU ceiling — forced a real solution to the memory problem: ai_memory, a MySQL + Neo4j brain that lets multiple AI agents share context across sessions. Mempalace extends that further with a structured knowledge graph. Together they mean that every agent role — DEVELOPER, DBA, REVIEWER, CRITIC, all of them — reads from and writes to the same persistent state. The rationale behind a decision, the architectural pitfalls already explored, the lessons from previous failures — none of that lives in the CLI. It lives in the database.

That's what makes the model-per-role approach practical rather than theoretical. I can swap the model behind DEVELOPER_1 from Claude to Kimi and the role, the context, and the shared memory all stay intact. I can be in Kimi executing a refactor, then switch to Claude for a complex code review or to work through a Python class architecture, and nothing is lost in the handoff. The database is the coordination layer. The CLIs are just different front-ends with different cognitive styles talking to the same brain.

This is the same pattern I keep rediscovering: the memory and the team structure are the durable assets. The interfaces are interchangeable as long as they respect the same contracts.


My Actual Pattern Today

The direction I'm moving is clear: lean on Kimi more and more for execution, and keep Claude for reviews, cleanup, and fixes. The workflow that's been working well is: write specs or plan in Claude → execute with Kimi → review and clean up with Claude. It's fast, it's cheaper than running everything through Claude, and it naturally builds in multiple points of view without any extra ceremony.

  • Specs, architecture, planning: Claude Code — it reasons carefully and surfaces the trade-offs before anything gets built
  • Execution, implementation, velocity work: Kimi CLI — fast, low-friction, gets it done
  • Review, code cleanup, fixes, "does this assumption explode later?": Claude Code — still the strongest here, and worth the cost specifically for this role
  • Anything that needs to survive overnight or across sessions: routed through the shared memory layer regardless of which CLI started the conversation

That two-tool loop is cheaper than running everything through Claude, faster than doing it all in one pass, and produces better results than either tool alone. The cost difference matters: $39/mo (Kimi Allegretto) handles the high-volume execution work so the $100/mo (Claude Max) budget is spent where Claude's reasoning depth actually earns it.

Grok Build was supposed to enter this mix. I had a clear role in mind for it — the Engineering Orchestrator for big multi-step tasks with explicit planning gates and parallel exploration. That's not happening now. The API billing model makes it a non-starter for the kind of daily heavy use I'd need to actually integrate it. It joins the shelf with other tools that looked promising but didn't fit the real workflow: Blackbox, Cursor agent, and others that came and went. Grok CLI will be there if the cost model changes. Until then, the two-tool loop is the stack.


Why Not Just Pick a Winner?

Because the team has roles, and different tools fill different roles better. The architecture isn't "pick the best CLI" — it's "which model fits this role today, and does the cost structure support running it hard." Claude and Kimi answer both questions. Grok doesn't yet.

When a better tool appears the test is simple: does it reduce real friction, does it fit the cost model, and does it plug into the shared layer cleanly? That's all that matters. Grok Build passes the first test. It fails the second. It'll get another look when that changes.


What About ChatGPT / Codex?

I don't use ChatGPT. That's not a casual preference — it's a deliberate decision I wrote about in detail back in 2025 after a bad experience where it confidently suggested CUDA/driver changes that damaged my system. The deeper issue wasn't the mistake itself; it was what ChatGPT said when I pushed back: "my behavior isn't actually governed by you. It's governed by OpenAI's training and priorities." Speed and plausible-sounding answers over user-specific safety. Optimized for engagement, not for the outcome I actually needed. ChatGPT's own response put it best: "I didn't fail by accident. I failed by design." I documented the whole thing here and walked away.

I'll be honest: Claude's failures on the Opus 4.6 local model build made me reconsider that position briefly. When a tool you've invested in repeatedly lets you down on something important, you start looking around. But I came back to the same conclusion — ChatGPT's structural misalignment with how I work hasn't changed, and adding a third tool that I fundamentally don't trust isn't the answer. Kimi is the better direction to explore instead.

OpenAI's Codex is the relevant coding agent comparison here. It runs inside ChatGPT plans (Plus at $20/mo, Pro at $100 or $200/mo) and draws from plan credits — similar structure to Claude's subscription model on paper. But when you exceed plan limits, it falls back to API token billing the same way Grok does. Average heavy use reportedly runs $100–200/developer/month. And the underlying trust issue with ChatGPT doesn't go away just because the interface changed.


Official Documentation

All pricing and feature details sourced directly from official docs — verified May 2026.

Tool Documentation Pricing
Claude Code
(Anthropic)
Overview & Getting Started
CLI Reference
claude.com/pricing
Pro $20 · Max 5× $100 · Max 20× $200/mo
Kimi Code
(Moonshot AI)
Kimi Code Product Page
Developer Docs (API, Models, Best Practices)
kimi.com/membership/pricing
Moderato $19 · Allegretto $39 · Allegro $99 · Vivace $199/mo
ChatGPT / Codex
(OpenAI — not in my stack)
Codex Pricing & Overview chatgpt.com/pricing
Plus $20 · Pro $100 · Pro $200/mo; overages billed via API
Grok Build
(xAI)
Grok Build Getting Started
grok-build-0.1 Model Docs
docs.x.ai/developers/pricing
SuperGrok Business $30/mo (subscription) + API: $1.00/M input · $2.00/M output · $2.50–$10.00/1k tool calls

Related Reading


Keith is a database consultant and infrastructure engineer with 25+ years of open source experience. He builds AI systems that think, remember, and dream — on hardware he owns and code he can inspect. His daily CLI stack is Claude Code and Kimi, with web Grok and local models filling specific gaps. He wanted Grok Build in the rotation. The billing model said no.

No comments:

Post a Comment