When people ask which model we run our agents on, the honest answer is there's no single "best" one to pick. The choice is several dimensions at once — provider, capability tier, how much the model thinks before it acts, and how well any of that fits the task in front of it — and they interact in ways I can't reason about from intuition. The one that still catches me off guard: more thinking isn't always better. For some tasks, turning up the reasoning made our eval scores worse, or added latency for no real gain. You'd never see that by eyeballing a handful of sessions — it only shows up once you have enough eval cases to compare, which is its own investment before you can even ask the question.
The other thing I've landed on is that picking a model isn't a one-time call. Before we change anything we run large eval suites and simulate locally. Even one-shot behavior is hard to characterize from a small sample — and our agents are the opposite of one-shot: they run many rounds, with the nondeterminism compounding at each step, so a handful of sessions tells you almost nothing. What I actually trust is watching a change play out across a broad suite of full runs. Once a model is live, our observability keeps collecting the signals that feed the next round — where it's slow, where it stalls, where it second-guesses itself. I used to think of the harness as the thing that runs the agent; lately I think about it just as much as the thing that tells me whether the model I picked last month is still the right one. That loop is most of where I'm spending my time right now.
We wanted to know how mabl engineers actually track a PR through CI now, so we polled the team: do you watch the GitHub Actions UI, or just ask Claude in the terminal where you made the change? It came back almost exactly 50/50.
Half the team still opens the dashboard, watches the checks spin, clicks into a red job, scrolls the log. The other half doesn't open the browser at all: they ask "where's my PR, is it green, what failed," and let the agent watch the run and report back the line that matters. Same task, two completely different interfaces, split right down the middle.
So which way does that split move? Is there a lasting place for a dashboard in the dev workflow, something you look at? Or does the whole thing get orchestrated through the chat interface, where you ask instead of watch? I genuinely don't know yet, and I want to see how it evolves over the next few months.
It's not an idle question for us. The answer shapes our own product integration priorities, how much we invest in things you watch versus things you talk to. We'll watch closely in the coming weeks to see how the split shifts internally and across our customer base.
I've been in a lot of discussions lately about independent testing agents, why you'd want one separate from your coding agent, and what makes it different. It isn't really about the model. Part of why coding agents feel so autonomous is that the verification problem was mostly solved for them. They inherit compilers, linters, test runners, CI, source control. Structured feedback that tells them, precisely, whether they're right or wrong. The model is impressive, but it's standing on decades of verification infrastructure it never had to build.
An independent testing agent inherits almost none of that. What it acts on isn't source code, it's a running app: non-deterministic, stateful, changing under it, and nothing volunteers whether it succeeded. There's also a reason you want it independent. A coding agent checking its own work is grading its own homework. You want a separate agent whose only job is deciding whether the product behaves. So its harness has to be built from scratch, and I've been working out what's actually in it.
Start with acting. A testing agent has to do what a user does. It clicks through a web app, taps through a mobile app, calls an API, so the harness has to give it real hands on the product across every surface customers use. But that part is getting commoditized. Any capable model can drive a UI or hit an endpoint now. Executing a test isn't the hard problem anymore.
The hard problem is verification. How does the agent know a login actually worked, a checkout actually completed, a page rendered the way it should? Generating the action is easy-ish; deciding whether the result was right is the whole job. Without a verification layer you haven't built an autonomous tester. You've built an autonomous clicker.
Verification starts with observing, deeply, the way a user would and a tester would. The agent needs screenshots, DOM state, network activity, logs, traces, the runtime behavior. But collecting evidence isn't enough. The harness has to compare what it saw against what was expected and decide whether the behavior was correct, and that judgment is the hard, valuable part.
It also has to do that efficiently. Easy to wave away until you run it at scale. The naive version, where you hand the whole DOM, every screenshot, and all the network traffic to the model on every step and ask "did that look right?", costs enormous tokens and time for one verdict. A good harness runs the cheap deterministic checks deterministically, saves the model for the judgments that need it, and uses what it already knows about the app so it isn't reasoning from raw pixels each run. At the scale a real suite runs, that's the difference between viable and not.
And verification compounds. Every run produces knowledge: which selectors are stable, which flows matter, which failures are expected, which recoveries work. A real harness keeps that and hands it to the next run. Without it, the agent shows up as a brand-new tester every time it opens the browser. It's most of why pointing a general-purpose agent at a browser only gets you so far.
And none of it matters unless people trust the verdict. The agent touches credentials, environments, and real data, so it has to run inside the same controls a person would, and explain itself: what it did, why it decided what it decided, what it saw and concluded. The harness isn't only constraining the agent; it's making its work auditable.
So here's where I've landed. Coding agents got a head start because software already had a verification harness. Compilers, tests, CI, and version control all tell an agent when it's right and wrong. Testing doesn't come with one, so it has to be built. And the hard part was never getting the agent to act. It's getting it to know what happened, judge whether it was right without spending a fortune to do it, and leave behind evidence the rest of us can trust.
A test-authoring agent paused on a page with an iframe and wrote out, verbatim, that mabl handles iframe switching automatically when you interact with elements. Its very next action was a JavaScript snippet "to investigate the structure first." The investigation became the strategy. From there the session never came back to the native interaction the trace had just said would work.
I've written before that reading the reasoning trace is how I tell whether an agent's weird move is principled or actually broken. This was a third kind I hadn't named: the trace contains the right conclusion and the agent still acts against it. The investigation was framed as a brief detour and quietly became load-bearing. I think this was closer to distraction than disagreement — the agent picked up an investigation, the investigation produced results, and chasing those results felt more immediately useful than zooming back out to the original plan. We need to make the goal harder to lose: a nudge back toward native steps after a JS call, and a tripwire after a few in a row. Hints to keep focus, not to override judgment.
A few times a week, someone pings me asking why an agent did something weird. The most recent was a teammate asking why our test authoring agent reached for a custom XPath selector unprompted. I didn't know and had to dig — when I did, the reasoning trace explained it. The choice looked odd from outside the session but was a coherent reaction to a failure earlier on.
I keep seeing this pattern with the coding agents I work with day to day, too. Decisions that look bizarre in isolation almost always have an internal logic once you read the thinking that led up to them, even when the decision itself is still wrong. Reading the trace is how I tell which kind of weird I'm looking at — a principled-but-incorrect step, or something actually broken.
What's worked better than trying to talk an agent out of its defaults has been to figure out what it's already inclined to do and shape the system around that, so the weird move isn't warranted in the first place.
I shipped a few things across our agent stack recently: a deadline-enforcement fix so server-side agents don't blow past their budget mid-round, a quieter bug where a [] versus undefined mismatch at a Java/TypeScript boundary was wiping stored artifacts on every continuation call, and a benchmark suite for the results-analysis agent so we can finally see where its latency budget is actually going. None of these were "agent problems." They were timeout handling, a type-system gotcha at a service boundary, and a measurement harness. The kind of thing you'd find in any distributed system.
I keep coming back to that. I think there's a real pull right now to treat agents like a category that exempts you from the basics, and I don't think it does. If anything I think agents make the basics more load-bearing, because when you're in a 30-iteration loop calling LLMs and tools, every weak point in your timeout, retry, and persistence story gets exercised. The artifact-clobber bug had been a slow leak everyone was living with. I think subtle issues like that are especially hard to surface on agent-shaped systems, because the LLM silently compensates around them for a while. The symptom looks like "the agent is a little off" instead of a hard failure, until eventually the compensation runs out. I think the answer is just normal robustness work to get reliable long-term behavior. Type-boundary tests, retry semantics, deadline plumbing. I'm spending a lot of time on those, and I don't think that's a phase.
Joe noticed everyone was hitting /usage constantly. Reasonable, if a loading spinner is your preferred way to check a number. The Claude Code status line can replace most of that: it takes a shell script and displays whatever you output in the bar at the bottom of every session. I'd set mine up to show hostname, working directory, git branch, current model, and token usage with color coding that shifts green → yellow → red as you approach the monthly limit. Joe built his own version with a few things I hadn't thought of, and I pulled those back into mine later. If you want something similar: add a statusLine command entry in ~/.claude/settings.json pointing to a script, then ask Claude to write it for you — it knows what data it exposes. The specifics are yours to decide.
We hit the monthly Claude cap again, and the conversation that followed was more interesting than the cap. The bill is still small relative to other line items, but it's a meaningful fraction of our non-prod GCP spend now, so the questions are starting to matter. A few we don't have answers to yet. Per-seat with overages, or move high-volume work to direct API consumption for better visibility? Should overages be equitable across the team, or should the people pushing hardest get more headroom by default? Are low caps actually a feature, in that they force a conversation about how someone's using their tokens? How do we get any real visibility into what's driving consumption — right now we mostly can't see it. And the one I keep coming back to: relative to the productivity gains we just earned, should we really be tight on costs at all? I don't think the answer is the same for any two of those. But it's worth being explicit that we're choosing, not optimizing.
I built a mabl debug command suite for investigating test failures, and the user isn't a human — it's an AI agent picking through a failure. That changes the design in ways I didn't expect. Pretty terminal output is wasted. The agent wants structured JSON and classification up front so it doesn't have to read 10K tokens to guess at root cause. Large artifacts (HAR captures, DOM snapshots, screenshots) go to disk so they don't blow the context window — the CLI prints a path, not the contents. The biggest shift was realizing the CLI should ship its own skill: an install-skill subcommand drops a Markdown tutorial into the agent's workspace so it learns how to use the tool from the tool itself. No docs site to find, no examples to dig up. The CLI is the tutorial. The lesson: when the consumer is an agent, the highest-leverage work is the analysis the agent can't do quickly itself — classification, fingerprinting, deployment correlation. A human debugging skims and pattern-matches. An agent needs you to do the pattern-match upstream and hand it the conclusion.