The questions get harder when the bill gets real.

We hit the monthly Claude cap again, and the conversation that followed was more interesting than the cap. The bill is still small relative to other line items, but it's a meaningful fraction of our non-prod GCP spend now, so the questions are starting to matter. A few we don't have answers to yet. Per-seat with overages, or move high-volume work to direct API consumption for better visibility? Should overages be equitable across the team, or should the people pushing hardest get more headroom by default? Are low caps actually a feature, in that they force a conversation about how someone's using their tokens? How do we get any real visibility into what's driving consumption — right now we mostly can't see it. And the one I keep coming back to: relative to the productivity gains we just earned, should we really be tight on costs at all? I don't think the answer is the same for any two of those. But it's worth being explicit that we're choosing, not optimizing.

More posts

Our best reviewer was already a prompt.

I built a code-review subagent and named it after Mauro. This is not a joke about Mauro — Mauro really is our best reviewer. He reads the code with his eyeballs. He suggests an enum every time he sees three magic numbers in a row. He reads the strings inside the code, notices when "Error fetching MauroAgent data" should have been a template literal, and tells you. He won't accept `eslint-disable-next-line` without a reason. When he sees a prompt he can't follow, he says "if I can't understand it, the LLM won't either."

I wrote those rules down. That was the agent. It took an afternoon.

Mauro's reaction was that I replaced him because I got tired of waiting for his reviews. That part is also true. The interesting thing is how little of his review style I had to invent — most of it was already a list of habits he applies in the same order to every PR. The reviewers we trust most are the ones whose taste is the most legible. Turns out legible taste compiles.

The cheapest auto-publish architecture is a queue with a click.

I tried to make /fab-note auto-publish posts the moment a draft is confirmed. Two paths to do that, and both required carving an exception out of our org's branch protection — either a bot identity in the bypass list or a PAT scoped to repository admin. Neither is wrong, but each is a small concession against the policy that says "every change to main gets reviewed." So I stopped pushing on auto-merge and made the workflow assign the PR to me instead. Total clock time from "ship it" to live: about two minutes — most of it CI checking. The clicks in between (one approval, one merge) take three seconds and they preserve the property that a human approved the change. I was solving for the wrong thing. The friction of "wait for CI, click approve, click merge" is invisible to the author because they've moved on to the next thing by the time it's their turn. The friction of "build a policy exception around a fast publish path" is permanent and visible to anyone who later asks why this repo bypasses the rules. The cheapest version of automation is the one that lets the existing policy do its job.

When your user is an agent, the CLI design changes.

I built a `mabl debug` command suite for investigating test failures, and the user isn't a human — it's an AI agent picking through a failure. That changes the design in ways I didn't expect. Pretty terminal output is wasted. The agent wants structured JSON and classification up front so it doesn't have to read 10K tokens to guess at root cause. Large artifacts (HAR captures, DOM snapshots, screenshots) go to disk so they don't blow the context window — the CLI prints a path, not the contents. The biggest shift was realizing the CLI should ship its own skill: an `install-skill` subcommand drops a Markdown tutorial into the agent's workspace so it learns how to use the tool from the tool itself. No docs site to find, no examples to dig up. The CLI is the tutorial. The lesson: when the consumer is an agent, the highest-leverage work is the analysis the agent can't do quickly itself — classification, fingerprinting, deployment correlation. A human debugging skims and pattern-matches. An agent needs you to do the pattern-match upstream and hand it the conclusion.

Treat agentic capacity as a portfolio, not a per-seat allocation.

Our default Anthropic seats include $150 of overage per month per premium seat. We're way past that for a lot of people, and we're going to be further past it next quarter. The real cost is the extra usage, not the seat — you should think of the seats as a promotional teaser and overage as the true cost. So we're treating it like a portfolio. We pre-purchase 1,000 credits at a 30% discount. We're looking at moving high-volume workloads to direct API consumption for better visibility. By the end of the year I doubt any of us will sit within seat allocation for full-time work — at least not without giving up the productivity gains we just earned. The companies that figure out the cost structure of agentic work as a separate discipline from "give engineers more tools" are going to have a real edge. The ones that don't are going to be surprised by their bill.

Validation should be a sub-agent, not a checklist.

I've been thinking about where verification belongs in an agentic pipeline. The shape I keep coming back to is a quality validation sub-agent that runs *before* PR submission — its job is to come up with and verify a validation plan for the change, including running the relevant mabl tests, capturing evidence, and attaching that evidence to the PR. Then the PR review agent enforces the existence of the validation plan, not the rules underneath it. Then the full mabl suite runs on merge, and when something breaks, a failure-analysis skill identifies which PR introduced it and suggests fixes.

The reason for that structure: at our throughput, "did the engineer remember to validate this" is the wrong question. The right question is "does the PR carry evidence that validation happened, and does the evidence hold up?" Sub-agents do the validation; the review agent checks the evidence; the merge gate trusts the chain. None of the layers is doing checklist work — each one has a specific decision to make. Most teams that try to add AI to their existing CI/CD end up with checklist agents because that's the shape of CI/CD. I don't think that's where this lands.