Posts tagged "ci"

6 posts · all topics

Sorry Claude, Gunna Need You to Come in on Saturday

7 days in a week, 7 days in a token budget. Why is your agent at the beach on Saturday? Think of all the chunky tech debt projects nobody ever has time for. That's what agents are for.

I had an API with hundreds of endpoints, and I wanted to refactor every one of them to a more modern, robust, faster framework. Who has time to rotely refactor controllers and re-validate that _nothing_ broke? Claude does, with a /goal.

The whole thing is unlocked by tests we wrote years ago. Thousands of API-level validation tests and end-to-end suites for the web apps that consume the API — that's the feedback signal a /goal actually needs. "Get the suites green without changing the clients or the interfaces, only the server implementation." That's it. From there our CI does the rest: every PR spins up a deploy preview, fires the full cloud regression suite, and reports back. The agent runs permutations across branches in parallel and validates each one on its own.

While you were at the beach worrying about how much sand your kids would track into the car, Claude burned down a major chunk of the tech debt backlog. LFG.

Tracking a PR through the pipeline: do I watch GitHub, or just ask Claude?

We wanted to know how mabl engineers actually track a PR through CI now, so we polled the team: do you watch the GitHub Actions UI, or just ask Claude in the terminal where you made the change? It came back almost exactly 50/50.

Tracking a PR through the pipeline: watch or chat? GitHub vs Claude


Half the team still opens the dashboard, watches the checks spin, clicks into a red job, scrolls the log. The other half doesn't open the browser at all: they ask "where's my PR, is it green, what failed," and let the agent watch the run and report back the line that matters. Same task, two completely different interfaces, split right down the middle.

So which way does that split move? Is there a lasting place for a dashboard in the dev workflow, something you look at? Or does the whole thing get orchestrated through the chat interface, where you ask instead of watch? I genuinely don't know yet, and I want to see how it evolves over the next few months.

It's not an idle question for us. The answer shapes our own product integration priorities, how much we invest in things you watch versus things you talk to. We'll watch closely in the coming weeks to see how the split shifts internally and across our customer base.

When a build fails, an agent reads the GitHub logs and runs /mabl-debug for you

A build fails. Now what? You go to GitHub, stare at a wall of actions, hunt for the red one, open it, and dig around to figure out what actually happened. And if it turns out to be a mabl deployment, you click the link and start all over again inside mabl. Forget about it. Nobody wants to do that.

So I made it stop. I built a github-build-explorer agent that runs on Haiku — cheap and fast, exactly right for digging through logs. It finds the red build and tells you what broke. And if it's a mabl deployment, /mabl-debug kicks in: it pulls the deployment, the failure analysis, the recovery sessions, goes hunting for the cause in the code, and reproduces the failure in a real browser with our new local debugger for agents.

It gets better. You can ask Claude to open the PR, wait for the red, and fix it — all on its own. And the best part: before it pushes anything back, the debugger re-runs the test to confirm it actually fixed the issue. Not "probably fixed." Fixed.

Eliminate the small stuff: automatic code styling FTW

Millions of PR comments a year get burned on style — tabs, spaces, imports, line breaks. That's a closed-form, solved problem. Why are you wasting your keystrokes and valuable context on style? We don't. No need to fill our CLAUDE.md files with 10 pages of format rules, no need for a new hire to spend a week learning our way to type. Don't fill your physical and virtual context windows with rules a CPU can apply.

So Spotless landed across our Java codebase this week, with a pre-commit hook and a CI check. I picked it specifically because it auto-fixes. Other tools will scold you about wildcard imports ("thou shalt not!"); Spotless will simply fix them, auto-magically. A tool that produces a report is the wrong pattern — at scale, we're shipping code, not reports. Calling "fix" is table stakes. The pre-commit hook means both human and agent operator styles are fixed before the PR opens. Neither has to grok the rules.

The goal isn't to sweat the small stuff. It's to eliminate it. I don't care how you use tabs or spaces. I care that your feature does what the customer needs. Leave the rest to the CPU. LFG.

Rockout to the stockout: 117,000 CI jobs in 30 days

Now that devs can readily integrate 10 PRs on a slow Monday, you'd better be serious about CI/CD (says the DevOps guy). My coworker just kicked off a CI job that used 3,000 cores. Did she bat an eyelash? Nah — it's $4, it'll get us some useful answers. Our compute provider hit a regional stockout (wasn't me) and we auto-routed around it. Our modest eng team ran 117,000 CI jobs in the last 30 days. About 4,000 jobs per contributor. All worth it when you've got a half-dozen agents coding, fixing, and validating on your behalf. Rockout to the stockout. Bits are cheap, light is fast, life is short. LFG.

Why our merge queue stops: a four-way race between matrix builds, runner names, GCP quota, and Pub/Sub.

Our productivity is up roughly 4x quarter over quarter. The thing I keep working on is making sure the build infrastructure can actually keep up. CLI builds intermittently failing on a datastore emulator issue. Self-hosted runners missing Docker. JDK downloads from Adoptium failing at random. The worst was a pernicious interplay between simultaneous matrix builds, GitHub's pre-registered runner names, GCP's 5-VM-per-call rate limit, and Pub/Sub retries — when all four collide, the merge queue stops moving. Our merge queue alone has cost us a day at a time when it breaks.

None of this work is glamorous. Cache config, runner audits, log diving, hunting down Pub/Sub retry semantics. But the math is straightforward: every minute the merge queue is broken is a minute the rest of the agentic pipeline is sitting idle. AI-native throughput needs reliable builds underneath it for the upstream gains to compound. Agents can write code as fast as they want — if main can't merge, none of it ships. So I keep investing here. It's the least visible work and the highest-leverage.