Lauren Leidal

agent platform

Why I've started channeling agent defaults instead of fighting them

A few times a week, someone pings me asking why an agent did something weird. The most recent was a teammate asking why our test authoring agent reached for a custom XPath selector unprompted. I didn't know and had to dig — when I did, the reasoning trace explained it. The choice looked odd from outside the session but was a coherent reaction to a failure earlier on.

I keep seeing this pattern with the coding agents I work with day to day, too. Decisions that look bizarre in isolation almost always have an internal logic once you read the thinking that led up to them, even when the decision itself is still wrong. Reading the trace is how I tell which kind of weird I'm looking at — a principled-but-incorrect step, or something actually broken.

What's worked better than trying to talk an agent out of its defaults has been to figure out what it's already inclined to do and shape the system around that, so the weird move isn't warranted in the first place.

Asking the test authoring agent what tools it wanted

There's a growing recognition in the industry that designing for agents-as-users is a different problem than designing for human users. We've been working through this on my team for a while, and I think it's worth naming what we've actually been doing.

When I was scoping the recent refactor of our test authoring agent's tool surface, I had the agent itself analyze its own tools and write me a report — what surprised it, what felt redundant, what it wanted that wasn't there. That report shaped the tool list in a recent upgrade that made the authoring agent a lot more capable. My teammate, Anja, did something similar with the new results analysis tool, asking a coding agent to explain why it kept reaching for one tool over another, so the design would hold up through MCP.

The pattern across both: when you're building something for an agent to use, the agent itself is the closest user-research subject you have. I want to reach for this approach earlier next time.

Why I think we need to be picky about our agent configuration

We hit a wave of 400s on Gemini 2.5 yesterday that turned out to be a useful kick. The short version is that in required tool-calling mode, Gemini 2.5 has trouble with the size of our test authoring agent's tool library, where Gemini 3.1 doesn't. The fix is to switch to a less strict mode, which is mostly what I'm doing — but it isn't a free flip. With required mode we've been quietly assuming the model always responds with a tool call, and the looser mode means it sometimes won't, which today would corrupt the generation session. So the fix carries its own risk that I have to handle deliberately.

The bigger thing is that the strictness of the tool-calling mode and the size of the tool library both have real costs, and they compound. Each new tool widens the state space the model has to compile in required mode, and it widens the assumptions our own session loop has to maintain. None of that was visible until yesterday — and it's easy to keep adding "just one more" tool, or to leave the mode set to required, when each addition feels small. I think we need to be more intentional about how we configure our agents — both which tools earn a slot, and what we're asking the model to commit to in return. Some of those tools could probably be lazy-loaded the way our skill instructions already are.

Three agent bugs that turned out to be timeouts, type boundaries, and missing benchmarks.

I shipped a few things across our agent stack recently: a deadline-enforcement fix so server-side agents don't blow past their budget mid-round, a quieter bug where a [] versus undefined mismatch at a Java/TypeScript boundary was wiping stored artifacts on every continuation call, and a benchmark suite for the results-analysis agent so we can finally see where its latency budget is actually going. None of these were "agent problems." They were timeout handling, a type-system gotcha at a service boundary, and a measurement harness. The kind of thing you'd find in any distributed system.

I keep coming back to that. I think there's a real pull right now to treat agents like a category that exempts you from the basics, and I don't think it does. If anything I think agents make the basics more load-bearing, because when you're in a 30-iteration loop calling LLMs and tools, every weak point in your timeout, retry, and persistence story gets exercised. The artifact-clobber bug had been a slow leak everyone was living with. I think subtle issues like that are especially hard to surface on agent-shaped systems, because the LLM silently compensates around them for a while. The symptom looks like "the agent is a little off" instead of a hard failure, until eventually the compensation runs out. I think the answer is just normal robustness work to get reliable long-term behavior. Type-boundary tests, retry semantics, deadline plumbing. I'm spending a lot of time on those, and I don't think that's a phase.