Sorry Claude, Gunna Need You to Come in on Saturday

7 days in a week, 7 days in a token budget. Why is your agent at the beach on Saturday? Think of all the chunky tech debt projects nobody ever has time for. That's what agents are for.

I had an API with hundreds of endpoints, and I wanted to refactor every one of them to a more modern, robust, faster framework. Who has time to rotely refactor controllers and re-validate that _nothing_ broke? Claude does, with a /goal.

The whole thing is unlocked by tests we wrote years ago. Thousands of API-level validation tests and end-to-end suites for the web apps that consume the API — that's the feedback signal a /goal actually needs. "Get the suites green without changing the clients or the interfaces, only the server implementation." That's it. From there our CI does the rest: every PR spins up a deploy preview, fires the full cloud regression suite, and reports back. The agent runs permutations across branches in parallel and validates each one on its own.

While you were at the beach worrying about how much sand your kids would track into the car, Claude burned down a major chunk of the tech debt backlog. LFG.
share X LinkedIn

More posts

Picking a model for our agents is sneakily complicated

When people ask which model we run our agents on, the honest answer is there's no single "best" one to pick. The choice is several dimensions at once — provider, capability tier, how much the model thinks before it acts, and how well any of that fits the task in front of it — and they interact in ways I can't reason about from intuition. The one that still catches me off guard: more thinking isn't always better. For some tasks, turning up the reasoning made our eval scores worse, or added latency for no real gain. You'd never see that by eyeballing a handful of sessions — it only shows up once you have enough eval cases to compare, which is its own investment before you can even ask the question.

The other thing I've landed on is that picking a model isn't a one-time call. Before we change anything we run large eval suites and simulate locally. Even one-shot behavior is hard to characterize from a small sample — and our agents are the opposite of one-shot: they run many rounds, with the nondeterminism compounding at each step, so a handful of sessions tells you almost nothing. What I actually trust is watching a change play out across a broad suite of full runs. Once a model is live, our observability keeps collecting the signals that feed the next round — where it's slow, where it stalls, where it second-guesses itself. I used to think of the harness as the thing that runs the agent; lately I think about it just as much as the thing that tells me whether the model I picked last month is still the right one. That loop is most of where I'm spending my time right now.

Watching SuperFab go from an idea to our company operating system in two weeks

Last week we launched SuperFab, the agentic operating system for how we run mabl. The idea is simple: one assistant that connects Claude to the tools we already use, so anyone here can ask a question in plain language and get a grounded, trustworthy answer back. No dashboards, no SQL, no hunting through six systems.

SuperFab: how it works — one assistant grounded in all of mabl's internal knowledge, built by everyone, built for trust


We ran it as an internal hackathon, and people were contributing within hours. Folks in sales, marketing, customer success, product, and support were building and sharing their own skills on top of it, fast. They saw the thing work, recognized their own expertise was exactly what it was missing, and packaged it up so the rest of us could use it too.

Our next step is to look across everything people built and figure out how the foundation needs to change to support it: more secure, more flexible, easier to extend. The skills are running ahead of the platform right now, and that's what I'm focused on next.

Tracking a PR through the pipeline: do I watch GitHub, or just ask Claude?

We wanted to know how mabl engineers actually track a PR through CI now, so we polled the team: do you watch the GitHub Actions UI, or just ask Claude in the terminal where you made the change? It came back almost exactly 50/50.

Tracking a PR through the pipeline: watch or chat? GitHub vs Claude


Half the team still opens the dashboard, watches the checks spin, clicks into a red job, scrolls the log. The other half doesn't open the browser at all: they ask "where's my PR, is it green, what failed," and let the agent watch the run and report back the line that matters. Same task, two completely different interfaces, split right down the middle.

So which way does that split move? Is there a lasting place for a dashboard in the dev workflow, something you look at? Or does the whole thing get orchestrated through the chat interface, where you ask instead of watch? I genuinely don't know yet, and I want to see how it evolves over the next few months.

It's not an idle question for us. The answer shapes our own product integration priorities, how much we invest in things you watch versus things you talk to. We'll watch closely in the coming weeks to see how the split shifts internally and across our customer base.

I can't keep a chat window open for every PR, so I wrote pr-memory

I had a handful of PRs open at the same time last week, and the bottleneck wasn't the code — it was me remembering what each session was doing by the time I came back to it. You can't keep a chat window open for every PR. So I wrote a small skill: save dumps the current session to a file keyed by a pull request, and load brings it back when I return to that PR. If one piece of work spans three repos, it's stored once and I can reload it by any of the three PR numbers. When every PR in a memory gets merged, it quietly deletes itself in the background — I'm not going to garbage-collect my own notes. The whole store is local and gitignored, because this is my mess to keep, not the team's.

Reviewing agent diffs in hunk so my comments land where the agent reads

I plan a change with the agent, it opens a PR, and then I do what everyone does: go to GitHub to actually read the diff. GitHub is still the best place to read code. The problem is the round trip. I'd leave a comment there, switch back to the terminal, and then… what? Tell the agent to go read my comment on line 40? Paste it back myself? That's not reviewing, that's being a courier for my own feedback.

So I started using Hunk (not Hulk — Hunk: https://github.com/modem-dev/hunk). It opens a diff viewer that the agent and I are both looking at. I leave inline comments and the agent reads them right where I left them. No copy-paste, no relaying through chat.

The part I didn't expect: if a Hunk session is open, mauro-reviewer drops its review notes straight into it instead of the chat. So now I don't have to leave the diff to talk to the agent about the diff.