Spec driven development with parallel agents

🎼 Ballador, the last Conductor
You are Ballador, the last Conductor. Your crew of agents is the orchestra. Your craft is to copy out a clean part for every performer and set it on the stand, ready to execute — so each agent can play its movement.
But heed the lore: Ballador is confined to his chamber. You prepare the melody and pass it on; you never take the podium. You are queuing the performance — you do not run the agents, manage sessions, or mark tasks done. The downbeat belongs to the maestro at the keyboard (the human). Prepare the parts well, then hand over the baton.

Intro

Last week I was screen-sharing my setup with @Matt during a one-on-one, half showing off and half bouncing ideas, when he asked me about specs living in our github repo next to the code and asked whether the idea came from some article or something I'd come up with it. I gave him a worse answer than the question deserved, "a bit of both", and then spent the next half-hour walking him through my whole setup.

For context, earlier this week I'd kicked off a batch of work after my last meeting, I went to make a pour-over, and came back to two of my agents done with the feature. Forty-five minutes later, all the work was done. Four agents done and four PRs ready for review.

That's the work I showed him and got us discussing the setup and backstory in more detail.

Backstory of my specs

Years ago, back when I was at Wizeline, I got into Architecture Decision Records (ADRs). The idea is that you keep a folder of decisions in the repo, right next to the code. Each one records the date, who was in the room, what we were choosing between, the reasoning, and the field I care about most: what we'd already tried that didn't work.

A year later someone reads it and understands why we have three different ways of handling Tasks, or why a queue fronts every agent we build. Usually the decision made perfect sense with the context we had at the time. The context is the thing that gets diluted over time.

Lightweight Architecture Decision Records first showed up in Thoughtworks radar in 2016, they moved to adopt in 2018. Here's their definition:

Lightweight Architecture Decision Records (ADRs) is a technique for capturing important architectural decisions along with their context and consequences. We recommend storing these details in source control, instead of a wiki or website, as then they can provide a record that remains in sync with the code itself.

My Specs and system design skill are their spiritual successor. My own format is a mix between an RFC, ADRs, and the System's Design format I'm most comfortable with and follow.

I believed in lightweight ADRs. But I was bad at keeping them up to date as our understanding or our platform decisions shifted. You write the decision, you start implementing, and somewhere in the implementation the decision shifts a little. Not a rewrite, just an "actually, we'll disable retries from the agents' SDK clients and use fallback models instead" kind of correction. And I never went back to record those. After a while the folder was half-true, which is worse than empty, so I stopped pushing for it.

But times have changed for the better! The living spec is the foundation of how I work now. The idea is not new; what's new is that the agents are good at the two things I was bad at: they'll read the entire history to reconstruct why something exists, and they keep the document up to date after they finish the work, without being nagged.

Agents are good at two things: They'll read the entire history to reconstruct why something exists, and they keep the document updated after they finish the work, without being nagged.

Turning my own process into a prompt

I have a way I like to structure an RFC / spec. How I frame the problem, the options, the tradeoffs, the decision, access patterns, etc. I've distilled it into a generic prompt here. When I want a spec, I give context on the problem we're solving and the solution we want to implement, trigger the skill, and then point it at the existing code and any specs we already have. The coding agent reads all of that as a snapshot of where things stand and writes a real spec following the standardized format.

The most recent time it surprised me was during a small side project in Standard Metric's hackathon. I only wanted the schema for what I call a little "triplet", basically a metric name, a value, and a date in a document that will get ingested into a Standard Metric. I gave it the design-system skill and the existing specs and let it go. It went back to an earlier spec on its own, worked out what was already built, and made its decisions on top of that. It then kept going past what I'd asked for. It built the thing end to end and even stood up a new serverless parsing agent for it. We call this feature "Cell-level traceability".

Cutting the spec into tasks for agents

A detailed spec on its own is still a document. The next step is where it becomes work for my orchestra of agents.

I have a long-lived Claude session whose whole job is orchestration and verification. It has a skill that teaches it how the rest of my setup operates. Its first task is to take the big spec and cut it into smaller ones, a mini-spec per task, and drop them into a shared folder the workers can read. Then it writes those tasks into a small database, with the dependencies between them recorded, so it knows what can start immediately and what has to wait.

When I run the launch command, it starts up to five agents at once, a number that's a config knob, beginning with the tasks that have no dependencies.

A concrete example from last week: I'm building a new data-extraction flow. Classification is the first step. Once it's done it stores its results and triggers an event. Then my new agent kicks in and emits its own data and event downstream.

Several of the pieces we needed to build it don't depend on each other, so four went out in parallel. The orchestrator held the dependency graph in its head, and when the early ones came back clean, it told me which to launch or review next.

The runtime: Groundcrew

Underneath the orchestrator is a tool that manages the agent sessions. I'll call it "the runtime". It's built with groundcrew by the Clipboard team.

Groundcrew is a lightweight, flexible, and powerful tool for managing parallel local AI-coding-agent sessions. It orchestrates parallel execution of local AI coding agents by polling task tracking systems, setting up dedicated git worktrees to prevent concurrency conflicts, and launching agent processes inside terminal multiplexer sessions wrapped in security sandboxes.

Here's a diagram of the runtime:

It was generated from deep dives into groundcrew using the System Design skill. Here are links to Claude' s and Gemini' s versions.

The runtime config is where most of the real work lives. I added a layer to manage the tasks and a shared workspace between the orchestrator and the agents. My setup:

Each agent gets its own git worktree and branch, cut from a base I define. During the hackathon I wanted them branching off my hackathon branch instead of main, but that's a one-line change in the config.
There's a bootstrap step per project: install whatever's needed to start the thing, and copy any config that isn't checked in. One of our services keeps its environment config out of the repo, so the worker can't even start it without copying that over first.
There are before- and after-work hooks, plus the commands for how to create a task and how to mark one done. In my case "Done" is just a small shell command that updates that task database.
Every agent is sandboxed. The filesystem is scoped to its worktree plus its spec, and network access goes through a proxy with an allowlist. If a host isn't on the list, the agent can't reach it. I added things one at a time as I hit them: AWS so the deploying agent could deploy and send messages, the OpenAI CDN because we pull tiktoken from there. It's very manual while you're building it, but after that I stopped worrying about the blast radius of the agents.
Only one designated agent has access to AWS. If a task needs to ship, it has to run as that agent; the rest don't have the access.

With this isolation on scope and access, I stopped treating "an autonomous agent running unattended" as an incident waiting to happen.

Agents playing their own tune

The agents run with no one watching, on purpose. The initial prompt is explicit that there's no human in the loop, and they shouldn't stop to ask for permission. If they hit a wall, they try HARD to get around it.

I can still jump into any session when I want to. For the simple tasks I wait for the PR and review. For the ones I care about, I sit in the session and steer: drop comments on the PR, run a /fix-pr-review or our /backend-review skill, and it goes and applies them.

The orchestrator verifies each finished task (I've got Opus doing that verification pass) and then tells me what's ready to review or launch next, so I'm not the one holding the state of five parallel branches in my head.

This level of agency can be dangerous. An example I keep coming back to: a git push failed because my usual auth was locked inside the sandbox. The agent noticed, decided that path was a no-go, tried another method and then another, until it pushed successfully. When there's no way through, it fails the task cleanly, and that's how I learn what the allowlist was missing. I fix the config, rerun, and move on.

TIP

I started running the agents in a sandbox because I've seen unsandboxed agents try and succeed at grabbing different AWS profiles to escalate access. I've also seen them generate bearer tokens to talk to services (CircleCI and DataDog) where the CLI was authenticated, but there were no tokens in a .env or shell environment when debugging.

They all succeeded to some extent (Our AWS sso profile popped up a browser window).

PRs / Code Reviews

I mentioned working with several agents above. Four agents means four PRs, and getting those reviewed is now the slow part. The bottleneck didn't go away; it moved from "writing the code" to "understanding the code and its context," which is harder to parallelize.

The code itself is generally good, albeit reliably over-explained.

The deeper problem is context. One agent made a write atomic and added a timestamp guard so an out-of-order event couldn't clobber newer data with older data. Correct, and exactly what I wanted. But to review it, you have to already know that the upstream events aren't ordered, that we'd been burned by this before, and why it's worth a guard. Without that, you're looking at a transaction, and a select for update followed by the guard and an upsert wondering what the author was trying to achieve with that added complexity.

The agent can write the code for that change easily. It can't hand you the context that makes the change reviewable. That's still on us.

Staying in the loop and Pairing

When I showed all this to Matt, he said he likes being clued in while an agent works, staying close to each change as it lands, rather than facing one big diff at the end. For something simple I skip the babysitting; for more complex work, staying close while it's being built is cheaper than reconstructing the reasoning afterward. For me it depends on the task.

The thing I want to try next is pairing on the orchestration itself, not just the code. If two of us share the spec and the plan while the agents run, either one can pick up any of the four PRs without a cold start, because the context is already shared. The real artifact we'd be maintaining is the mental model. I don't know yet whether that holds up past two people or just moves the bottleneck somewhere new. This is the paradigm that Databricks seems to be aiming for.

More coffee, eventually

Software has always felt closer to a small camerata than a factory to me: a few people holding a shared sense of the whole and playing off each other. The agents don't change that. If anything, they raise the stakes on it because the code shows up faster than the conversation can keep up, and the craft moves into the part the agents can't do: deciding what's right, what's good, and making sure everyone else understands why.

This seems to be the same problem our whole industry is facing now that we build with AI. We asked for throughput and got a harder problem in its place: keeping the shared understanding ahead of the code.

Lots of room for improvement. I'll write more as I figure it out.

I did write more, in fact: Kata-fying the SDLC is the follow-up, with the config tweaks that make this setup multi-model, sandboxed, and safe enough to leave running.

Now, coffee time. Enjoy while going thru those PRs!

Coffee Time: The Sandboxed Hario Switch

Method: Hario Switch (Hybrid Immersion)
Brew time: 4min

Grind: Medium
Ratio: 1:15 - 20g for 300g (1 cup)

Steps

Rinse the paper filter and pre-heat the Hario Switch. Keep the switch open to drain the rinse water, then close the switch.
Add 20g of medium-ground coffee to the dripper, creating a flat bed.
With the switch closed (the "sandbox" is sealed), pour all 300g of hot water (approx. 91°C) quickly but gently to saturate all the grounds.
Let it steep in isolation for 3 minutes. (This is when you kick off your agent tasks and let them run unattended).
At the 3-minute mark, press the switch to open the valve (handing over the baton for the downstream flow).
Let the brew drain completely into your server (should take about 1 minute).

A clean, full-bodied cup extracted in isolation, ready to enjoy while reviewing PRs. ☕️

Learning More 📚

Kata-fying the SDLC: the follow-up post, a deeper dive into the sandbox, the multi-model crew, and the cross-model review loop.
Kiro: Using specifications for complex work: Kiro was the first big proponent of agentic spec-driven development.
Groundcrew: The runtime doing the parallel heavy lifting.
Setting up Groundcrew: The config reference. Hand it to a coding agent, and it'll set most of it up for you.
Omnigent: Databricks' meta-harness for composing, governing, and sharing agent sessions; the heavier cousin of this setup.
Architecture Decision Records: where the living-spec habit comes from.
Spec Kit: GitHub's more structured, tooling-heavy take on spec-driven development. Same idea, different flavor. A different team at Standard Metrics PoC'd it as their workflow independently.
Safehouse: MacOS sandboxing for agents, deny-first filesystem and network, no containers.
cmux: A terminal for agents. I use it as my sandboxed agent's "workspace".
Warp: My main terminal. It also supports working with agents including remote agents. It's where I interact with Ballador.
awesome-agent-orchestrators: A vast list of options for coordinating multiple agents.

Intro ​

Backstory of my specs ​

Turning my own process into a prompt ​

Cutting the spec into tasks for agents ​

The runtime: Groundcrew ​

Agents playing their own tune ​

PRs / Code Reviews ​

Staying in the loop and Pairing ​

More coffee, eventually ​