Posts tagged "automation"

14 posts · all topics

The Cost of Efficiency

AI has made me an incredibly productive worker. I don't need to ask a coworker when I get stuck or brainstorm with teammates. I get answers instantly (and mostly) correctly. While it's fun to be this autonomous, asking for help was exactly how I built work relationships in the first place. I'm getting more done in 40 hours than ever before, but somehow it feels like less. In engineering away our bottlenecks, we've quietly engineered away each other.

Sorry Claude, Gunna Need You to Come in on Saturday

7 days in a week, 7 days in a token budget. Why is your agent at the beach on Saturday? Think of all the chunky tech debt projects nobody ever has time for. That's what agents are for.

I had an API with hundreds of endpoints, and I wanted to refactor every one of them to a more modern, robust, faster framework. Who has time to rotely refactor controllers and re-validate that _nothing_ broke? Claude does, with a /goal.

The whole thing is unlocked by tests we wrote years ago. Thousands of API-level validation tests and end-to-end suites for the web apps that consume the API — that's the feedback signal a /goal actually needs. "Get the suites green without changing the clients or the interfaces, only the server implementation." That's it. From there our CI does the rest: every PR spins up a deploy preview, fires the full cloud regression suite, and reports back. The agent runs permutations across branches in parallel and validates each one on its own.

While you were at the beach worrying about how much sand your kids would track into the car, Claude burned down a major chunk of the tech debt backlog. LFG.

What makes a testing agent different from a coding agent?

I've been in a lot of discussions lately about independent testing agents, why you'd want one separate from your coding agent, and what makes it different. It isn't really about the model. Part of why coding agents feel so autonomous is that the verification problem was mostly solved for them. They inherit compilers, linters, test runners, CI, source control. Structured feedback that tells them, precisely, whether they're right or wrong. The model is impressive, but it's standing on decades of verification infrastructure it never had to build.

An independent testing agent inherits almost none of that. What it acts on isn't source code, it's a running app: non-deterministic, stateful, changing under it, and nothing volunteers whether it succeeded. There's also a reason you want it independent. A coding agent checking its own work is grading its own homework. You want a separate agent whose only job is deciding whether the product behaves. So its harness has to be built from scratch, and I've been working out what's actually in it.

Start with acting. A testing agent has to do what a user does. It clicks through a web app, taps through a mobile app, calls an API, so the harness has to give it real hands on the product across every surface customers use. But that part is getting commoditized. Any capable model can drive a UI or hit an endpoint now. Executing a test isn't the hard problem anymore.

The hard problem is verification. How does the agent know a login actually worked, a checkout actually completed, a page rendered the way it should? Generating the action is easy-ish; deciding whether the result was right is the whole job. Without a verification layer you haven't built an autonomous tester. You've built an autonomous clicker.

Verification starts with observing, deeply, the way a user would and a tester would. The agent needs screenshots, DOM state, network activity, logs, traces, the runtime behavior. But collecting evidence isn't enough. The harness has to compare what it saw against what was expected and decide whether the behavior was correct, and that judgment is the hard, valuable part.

It also has to do that efficiently. Easy to wave away until you run it at scale. The naive version, where you hand the whole DOM, every screenshot, and all the network traffic to the model on every step and ask "did that look right?", costs enormous tokens and time for one verdict. A good harness runs the cheap deterministic checks deterministically, saves the model for the judgments that need it, and uses what it already knows about the app so it isn't reasoning from raw pixels each run. At the scale a real suite runs, that's the difference between viable and not.

And verification compounds. Every run produces knowledge: which selectors are stable, which flows matter, which failures are expected, which recoveries work. A real harness keeps that and hands it to the next run. Without it, the agent shows up as a brand-new tester every time it opens the browser. It's most of why pointing a general-purpose agent at a browser only gets you so far.

And none of it matters unless people trust the verdict. The agent touches credentials, environments, and real data, so it has to run inside the same controls a person would, and explain itself: what it did, why it decided what it decided, what it saw and concluded. The harness isn't only constraining the agent; it's making its work auditable.

So here's where I've landed. Coding agents got a head start because software already had a verification harness. Compilers, tests, CI, and version control all tell an agent when it's right and wrong. Testing doesn't come with one, so it has to be built. And the hard part was never getting the agent to act. It's getting it to know what happened, judge whether it was right without spending a fortune to do it, and leave behind evidence the rest of us can trust.

When a build fails, an agent reads the GitHub logs and runs /mabl-debug for you

A build fails. Now what? You go to GitHub, stare at a wall of actions, hunt for the red one, open it, and dig around to figure out what actually happened. And if it turns out to be a mabl deployment, you click the link and start all over again inside mabl. Forget about it. Nobody wants to do that.

So I made it stop. I built a github-build-explorer agent that runs on Haiku — cheap and fast, exactly right for digging through logs. It finds the red build and tells you what broke. And if it's a mabl deployment, /mabl-debug kicks in: it pulls the deployment, the failure analysis, the recovery sessions, goes hunting for the cause in the code, and reproduces the failure in a real browser with our new local debugger for agents.

It gets better. You can ask Claude to open the PR, wait for the red, and fix it — all on its own. And the best part: before it pushes anything back, the debugger re-runs the test to confirm it actually fixed the issue. Not "probably fixed." Fixed.

Why I've started channeling agent defaults instead of fighting them

A few times a week, someone pings me asking why an agent did something weird. The most recent was a teammate asking why our test authoring agent reached for a custom XPath selector unprompted. I didn't know and had to dig — when I did, the reasoning trace explained it. The choice looked odd from outside the session but was a coherent reaction to a failure earlier on.

I keep seeing this pattern with the coding agents I work with day to day, too. Decisions that look bizarre in isolation almost always have an internal logic once you read the thinking that led up to them, even when the decision itself is still wrong. Reading the trace is how I tell which kind of weird I'm looking at — a principled-but-incorrect step, or something actually broken.

What's worked better than trying to talk an agent out of its defaults has been to figure out what it's already inclined to do and shape the system around that, so the weird move isn't warranted in the first place.

Where we're going, we don't need IDEs

I haven't opened IntelliJ Ultimate in months — best tool, btw. I say this at conferences and people look at me in disbelief. You only need an IDE if you're reading or writing the code yourself. That's very 25Q4. My setup now: tricked-out tmux and eight Claude Code sessions running in parallel. The reason this works at all: years of investing in CI automation, linting, test coverage, and reviewer tooling. Those bets are paying off. Without that scaffolding, eight parallel agents would just be 8x the ways to break main. My job is approving the PRs, challenging the assumptions and the designs, keeping the agents honest. I'm here to spot the square wheels, catch the BS, avoid the foot guns, and keep this machination a cohesive whole. Type 2K lines yourself, then spend all day reviewing them? No, it's 2026, y'all. We've got tools for that. LFG.

It's time for Beast Mode: be uncomfortably motivated

I'm an efficiency addict. I eschew slowness. First tech job out of college in 2010, I brought my own pair of widescreen LCDs into the office because the standard 17" square was unworkable — facilities was annoyed. I bought 3x the RAM with my own cash and upgraded the machine; IT warned I might burn the building down. I used Cygwin and scp instead of CMD and drag-and-drop, and management called me "uncomfortably motivated." Sixteen years on, we have agents with whale-size brains running parallel jobs while we sleep. Hardware helps — 128GiB of RAM, four monitors, 2-gig fiber, 32 cores. But the rig is just one example. Buy your own gear if you have to. Install the better tool. Ignore the polite limit. This is the time to literally be a 100x engineer. LFG.

Recovering human data vacuum: scheduled agents on alerts, opex, logs, weather

I'm a recovering human data vacuum. There's never enough time to watch dashboards, sift the overnight 5xx spike, scroll service logs, eyeball opex, and then go build something. So I stopped doing it. I have scheduled Claude agents running against monitoring alerts, opex, service logs, and yes — today's weather (boots for my tot's?). They run on a cron, do the boring analysis, and only ping me if something is actually sliding sideways. Slack DMs and @mentions land in my client; email is for dinosaurs. The point is to protect my own context window — every minute I spend triaging a chart that turned out to be fine is a minute I'm not building the next thing. Agents are unreasonably good at the "skim a wall of telemetry, surface the one weird thing" job. So I let them. LFG.

The last 90 days at mabl: 2.5x the PRs, almost 2x the code, same number of people.

I compared engineering output from the last 90 days against the 90 days before it. PRs went from 811 to 2,068 — two and a half times more. Lines of code shipped almost doubled. The number of active authors barely moved, from 31 to 35, so almost all of the lift is per-engineer throughput, not headcount. The other thing the data shows is that PR count grew faster than LOC, which means PRs are getting smaller on average — more, tighter changes instead of bigger ones. Reviews almost doubled too, which tracks. The thing I'm looking for now is the bottleneck. What's actually holding people back at this throughput — human review, testing, scoping, something else? That's where I want to spend the next month.

Why I stopped trying to make /fab-note auto-merge to main.

I tried to make /fab-note auto-publish posts the moment a draft is confirmed. Two paths to do that, and both required carving an exception out of our org's branch protection — either a bot identity in the bypass list or a PAT scoped to repository admin. Neither is wrong, but each is a small concession against the policy that says "every change to main gets reviewed." So I stopped pushing on auto-merge and made the workflow assign the PR to me instead. Total clock time from "ship it" to live: about two minutes — most of it CI checking. The clicks in between (one approval, one merge) take three seconds and they preserve the property that a human approved the change. I was solving for the wrong thing. The friction of "wait for CI, click approve, click merge" is invisible to the author because they've moved on to the next thing by the time it's their turn. The friction of "build a policy exception around a fast publish path" is permanent and visible to anyone who later asks why this repo bypasses the rules. The cheapest version of automation is the one that lets the existing policy do its job.

Building mabl debug for an AI agent reader instead of a human one.

I built a mabl debug command suite for investigating test failures, and the user isn't a human — it's an AI agent picking through a failure. That changes the design in ways I didn't expect. Pretty terminal output is wasted. The agent wants structured JSON and classification up front so it doesn't have to read 10K tokens to guess at root cause. Large artifacts (HAR captures, DOM snapshots, screenshots) go to disk so they don't blow the context window — the CLI prints a path, not the contents. The biggest shift was realizing the CLI should ship its own skill: an install-skill subcommand drops a Markdown tutorial into the agent's workspace so it learns how to use the tool from the tool itself. No docs site to find, no examples to dig up. The CLI is the tutorial. The lesson: when the consumer is an agent, the highest-leverage work is the analysis the agent can't do quickly itself — classification, fingerprinting, deployment correlation. A human debugging skims and pattern-matches. An agent needs you to do the pattern-match upstream and hand it the conclusion.

When to write a script instead of letting the agent reason: a shared scripts directory for cross-repo ops.

I keep watching Claude reinvent the same shell pipeline three different ways across sessions. Routine cross-repo operations like dependency bumps are the canonical example: an engineer asks Claude to do it, and Claude figures out a slightly different approach each time — usually right, sometimes wrong, always slow.

What I've been pushing for is a shared scripts directory for the things that are deterministic. Bump a CLI version into a downstream repo? Script. Generate a new connector skeleton? Script. Snapshot a runner config? Script. When the work has a known shape, the agent shouldn't be reasoning it out — it should be calling the script. We pay for the agent's reasoning when we need reasoning. We shouldn't pay for it when we just need the right command in the right order. The cleaner the line we draw between "this is a deterministic operation" and "this needs the model," the better the system gets at both.

Why fab.mabl.com's authoring interface is a Claude skill, not a CMS or a PR-per-post.

I wanted to stand up an engineering log at fab.mabl.com — short posts from people on the delivery pipeline, public, in our voice. The first instinct was the obvious one: a CMS, or markdown files in a repo with a PR per post. Both are wrong for us. A CMS adds a vendor for ten posts a year. A PR per post means Joey doesn't write the post, because forking and branching to publish a paragraph isn't how anyone gets a paragraph out of their head. So we did the third thing: posts are markdown files in the repo, and the authoring interface is a Claude skill called /fab-note that drafts a post in the established voice, confirms it with the author, and commits the file directly to main. The author never sees a PR. Git history is preserved. The skill is the CMS. The point that surprised me writing this: the right interface for a publishing system isn't a form or a PR, it's the tool the authors are already inside. Our engineers spend their day in Claude Code. Meeting them there cost less than building anything else.

Wiring /codex-review across 16 repos so we stop reviewing every PR.

At our current PR throughput, I'm going to burn out on reviews alone, never mind actual work. We can't move at this pace while also having a human read every change. The model I want us to move toward: ask for human review when you're truly unsure about something, and let the agent reviewers handle the rest.

The piece I shipped to make this real was getting /codex-review wired up across every repo as an on-demand second opinion. Different model family from Claude, uncorrelated blind spots, one comment away when you want it. The work was sixteen PRs across sixteen repos to enable the GitHub Actions trigger, plus a change to our shared workflows repo to make it a first-class slash command. The conceptual work was harder: deciding that "two agent reviews and a human glance" is now an acceptable pre-merge state for routine changes, and reserving real human attention for the changes that genuinely need it. We're not all the way there. But the trajectory is clear, and I'd rather build the routing now than burn out catching up to it later.