Three agent bugs that turned out to be timeouts, type boundaries, and missing benchmarks.

Lauren Leidal agents + testing 2026-05-05

Three agent bugs that turned out to be timeouts, type boundaries, and missing benchmarks.

I shipped a few things across our agent stack recently: a deadline-enforcement fix so server-side agents don't blow past their budget mid-round, a quieter bug where a [] versus undefined mismatch at a Java/TypeScript boundary was wiping stored artifacts on every continuation call, and a benchmark suite for the results-analysis agent so we can finally see where its latency budget is actually going. None of these were "agent problems." They were timeout handling, a type-system gotcha at a service boundary, and a measurement harness. The kind of thing you'd find in any distributed system.

I keep coming back to that. I think there's a real pull right now to treat agents like a category that exempts you from the basics, and I don't think it does. If anything I think agents make the basics more load-bearing, because when you're in a 30-iteration loop calling LLMs and tools, every weak point in your timeout, retry, and persistence story gets exercised. The artifact-clobber bug had been a slow leak everyone was living with. I think subtle issues like that are especially hard to surface on agent-shaped systems, because the LLM silently compensates around them for a while. The symptom looks like "the agent is a little off" instead of a hard failure, until eventually the compensation runs out. I think the answer is just normal robustness work to get reliable long-term behavior. Type-boundary tests, retry semantics, deadline plumbing. I'm spending a lot of time on those, and I don't think that's a phase.

evals observability

Lauren Leidal agent platform 2026-05-10

Why I've started channeling agent defaults instead of fighting them

A few times a week, someone pings me asking why an agent did something weird. The most recent was a teammate asking why our test authoring agent reached for a custom XPath selector unprompted. I didn't know and had to dig — when I did, the reasoning trace explained it. The choice looked odd from outside the session but was a coherent reaction to a failure earlier on.

I keep seeing this pattern with the coding agents I work with day to day, too. Decisions that look bizarre in isolation almost always have an internal logic once you read the thinking that led up to them, even when the decision itself is still wrong. Reading the trace is how I tell which kind of weird I'm looking at — a principled-but-incorrect step, or something actually broken.

What's worked better than trying to talk an agent out of its defaults has been to figure out what it's already inclined to do and shape the system around that, so the weird move isn't warranted in the first place.

observability automation

Joe Lust release ops 2026-05-10

Rockout to the stockout: 117,000 CI jobs in 30 days

Now that devs can readily integrate 10 PRs on a slow Monday, you'd better be serious about CI/CD (says the DevOps guy). My coworker just kicked off a CI job that used 3,000 cores. Did she bat an eyelash? Nah — it's $4, it'll get us some useful answers. Our compute provider hit a regional stockout (wasn't me) and we auto-routed around it. Our modest eng team ran 117,000 CI jobs in the last 30 days. About 4,000 jobs per contributor. All worth it when you've got a half-dozen agents coding, fixing, and validating on your behalf. Rockout to the stockout. Bits are cheap, light is fast, life is short. LFG.

ci ai-costs

Lauren Leidal agent platform 2026-05-09

Asking the test authoring agent what tools it wanted

There's a growing recognition in the industry that designing for agents-as-users is a different problem than designing for human users. We've been working through this on my team for a while, and I think it's worth naming what we've actually been doing.

When I was scoping the recent refactor of our test authoring agent's tool surface, I had the agent itself analyze its own tools and write me a report — what surprised it, what felt redundant, what it wanted that wasn't there. That report shaped the tool list in a recent upgrade that made the authoring agent a lot more capable. My teammate, Anja, did something similar with the new results analysis tool, asking a coding agent to explain why it kept reaching for one tool over another, so the design would hold up through MCP.

The pattern across both: when you're building something for an agent to use, the agent itself is the closest user-research subject you have. I want to reach for this approach earlier next time.

evals mcp-tooling

Joe Lust release ops 2026-05-09

Where we're going, we don't need IDEs

I haven't opened IntelliJ Ultimate in months — best tool, btw. I say this at conferences and people look at me in disbelief. You only need an IDE if you're reading or writing the code yourself. That's very 25Q4. My setup now: tricked-out tmux and eight Claude Code sessions running in parallel. The reason this works at all: years of investing in CI automation, linting, test coverage, and reviewer tooling. Those bets are paying off. Without that scaffolding, eight parallel agents would just be 8x the ways to break main. My job is approving the PRs, challenging the assumptions and the designs, keeping the agents honest. I'm here to spot the square wheels, catch the BS, avoid the foot guns, and keep this machination a cohesive whole. Type 2K lines yourself, then spend all day reviewing them? No, it's 2026, y'all. We've got tools for that. LFG.

automation

Joe Lust release ops 2026-05-08

It's time for Beast Mode: be uncomfortably motivated

I'm an efficiency addict. I eschew slowness. First tech job out of college in 2010, I brought my own pair of widescreen LCDs into the office because the standard 17" square was unworkable — facilities was annoyed. I bought 3x the RAM with my own cash and upgraded the machine; IT warned I might burn the building down. I used Cygwin and scp instead of CMD and drag-and-drop, and management called me "uncomfortably motivated." Sixteen years on, we have agents with whale-size brains running parallel jobs while we sleep. Hardware helps — 128GiB of RAM, four monitors, 2-gig fiber, 32 cores. But the rig is just one example. Buy your own gear if you have to. Install the better tool. Ignore the polite limit. This is the time to literally be a 100x engineer. LFG.

automation