Popular searches
//

Autonomous development workflows with Claude Code

22.6.2026 | 16 minutes reading time

Most developers today use AI tools as faster autocomplete. Over the past few months, on a client project, I took a different path: multi-agent setups with Claude Code, where specialized agents work in parallel, review one another, and coordinate on their own. In this article I describe the journey there — from the initial frustration with missing context handoff, through gradually building reproducible Blueprints, to workflows that produce code I no longer write line by line myself. An experience report with concrete learnings for anyone who wants to take the next step toward autonomous development.

Contents


Five levels, one mirror

In early 2026, Dan Shapiro published an article that quickly made the rounds in the industry: The Five Levels from Spicy Autocomplete to the Software Factory. The idea behind it isn't especially original — essentially, the familiar autonomy levels of self-driving cars were transferred to AI software development. But since the industry had no vocabulary of its own for what many of us are living through right now, the framework was soaked up like a sponge.

Here's the overview in brief:

LevelWhat the AI doesWhat the human does
0 – Manualnothingwrites every line itself
1 – Assisteddiscrete individual tasks (tests, docs)controls the entire process
2 – Collaborative developmenttakes over routine tasks in the flowleads — owns structure, context & responsibility
3 – Human oversightbecomes the primary developerreviews diffs instead of writing code
4 – Autonomous executionimplements specifications over longer periodswrites specs, checks results after the fact
5 – Dark factoryturns specs directly into softwarebarely involved anymore

The crux is at level 2: this is where most "AI-native" developers work today — and that's exactly the problem. Level 2 feels complete. It isn't.

Shapiro estimates that most teams stagnate at level 3. Only a few small teams have reached level 5 so far.

I'd place myself between level 3 and 4 — with a clear lean toward 4. What got me there is less a particular tool than a changed way of thinking about how I work with AI.


The lure of level 2

Level 2 is comfortable — you write code, the AI helps you do it faster and with less typing, sometimes with surprisingly good ideas. The flow is right, and you feel productive.

This perception is also reflected in the numbers: according to a study by the Weizenbaum Institute (Krzywdzinski & Wotschack, 2026), the central goals of using AI remain automating work steps and boosting efficiency. 57 percent of the surveyed companies use AI assistants to support their employees. The freed-up time flows into quality improvements — not into a change in the way of working itself. Doing the same thing faster, not working differently.

But level 2 has a subtle limit: the AI reacts, the human leads — and responsibility for structure, context, and decisions rests entirely with you.

That means: as complexity grows, so does your own effort. At level 2, AI support doesn't automatically scale with the problem.


The turning point

My path to autonomous workflows wasn't a straight line.

The first step was online tools like ChatGPT or Gemini. Helpful — but with a fundamental problem: the project context is missing. You can't know for sure what the model needs to truly understand a question. And even if you did, it simply wouldn't be feasible to manually copy everything relevant into the chat. The result: generic answers to complex questions.

GitHub Copilot was another attempt. Right in the editor, always available — but also only with the current file as context. What wasn't visible didn't exist for Copilot. Personally, the tool got in my way more often than it helped.

The real breakthrough came with agent tools like Claude Code: agents that read project context on their own — files, structures, and dependencies. Suddenly I could ask a question without having to decide myself what the model needs to know for it — the agent handles that. The context is thus much larger than with any online tool — but it only really comes together when you deliberately give the agent additional information: Rules (fixed rules the agent follows automatically), project descriptions, good problem descriptions.

The concrete trigger

A few months ago, a new client project started. The team was asked to evaluate whether Claude Code was usable. My part is primarily the Rust backend — a set of Lambdas and libraries, the largest of them an API based on Axum — and the cloud infrastructure.

At the start it was frustrating. It was hard to give the agent the necessary context — which was a big learning experience on our side. And it was hard to get it to look at a problem from different angles: security, testing, maintainability. You had to initiate everything yourself.

That changed with the release of Opus 4.6 and the experimental "Agent Teams" feature. From that moment on, we could not only deploy specialized agents but interact with them directly — previously they were already running but couldn't be addressed directly, which made debugging harder. By then we had grasped how the pieces fit together, and I began building my own multi-agent setups with Agent Teams — configurations in which different agents, each with their own knowledge and their own perspective, work together.

Before, I worked with Claude in dialogue — question, answer, correction, next question — useful, but at its core not much different from pair programming.

With Agent Teams, the dialogue became a process. A Lead coordinates, Teammates work in parallel, a shared task list keeps the overview. The decisive difference from Subagents — helper agents that only report back to the Lead: Teammates communicate directly with one another. They exchange intermediate results, question each other, and coordinate without my involvement. This is no longer an assistant, but an autonomous development team.


The shift in thinking

The step to level 3/4 isn't just technical — above all, it's mental.

For a long time I thought: "I develop software." With Agent Teams that shifts: "I configure Agent Teams." I write less code — but I decide how the team is set up, which rules apply, which quality criteria take effect. The work becomes more abstract, the impact greater.

Letting go is hard for many — and that's understandable. Developing is fun: solving problems, designing systems, writing elegant code — these aren't activities you gladly hand off. I fully understand that.

Still, I believe: those who don't take this step won't be able to keep up in the long run. For now, you still need to understand how software works under the hood — that remains important for steering agents sensibly and judging their results. But the better the models get, the less you'll be confronted with the details.

The real value shifts: from the ability to write code to the ability to design good systems — and to set up Agent Teams so that they reliably produce good code. A strong IT background remains, in my view, more necessary than ever. Because the more fast-moving software and systems become, the more important it is that people devote their attention to them and keep them maintained — albeit on a different level.


The craft: what a setup looks like

The jump to level 3/4 doesn't happen through more prompting — it happens through structure.

Before I describe the individual building blocks, a quick look at how a typical work cycle feels.

I describe what I want to implement. The Lead asks clarifying questions — few or many, depending on complexity. For tasks in areas familiar to me, that's settled in minutes. For topics where I have no detailed knowledge myself — like the Language Server Protocol for rlsp, a project of my own — I ask the agent questions too, in order to be able to make decisions at all. Then a plan emerges that is reviewed internally before I see it. I look at it, check it, and clear up any questions if something is unclear to me. Once I approve the plan, the agent team starts implementing.

What I do during execution depends on the task. On unfamiliar terrain I watch — sometimes I have the Lead stop between tasks to make sure everything is going according to plan. For routine tasks I look at the commits afterward, sometimes not even that. If something goes fundamentally wrong, I adjust the Blueprint (the copyable setup template) and run the plan from the start again — which at the same time shows whether the change to the setup takes effect.

At first I intervened far more often. By now, even more complex tasks run through reliably. The work shifts: less intervening, more deciding up front.

The Lead — orchestrator, not developer

In the case of Claude Code, the Lead is the central coordinator — the initial session you start via the CLI. It communicates with the user, clarifies tasks, proposes workflows, and coordinates the team. What it explicitly does not do: develop itself.

That sounds simpler than it is. The Lead is configured via <project-directory>/.claude/CLAUDE.md — and unlike the specialized agents, whose permissions can be restricted via frontmatter, it always runs with full permissions. That makes it especially prone to leaving prescribed paths and doing things itself instead of delegating. My experience: it's best to use the Lead consistently as an orchestrator and communication hub. Clear instructions in the CLAUDE.md help with that — including explicit prohibitions.

Blueprints — reproducible setups

A Blueprint is a copy template: a .claude/ directory with everything a project needs for autonomous agent workflows. You copy it into the project, adjust it, and the setup is ready to use.

The building blocks that Claude Code ships with out of the box and that a Blueprint uses:

  • CLAUDE.md (Lead instructions): role, startup behavior, clarification process — the central configuration file the Lead loads on start
  • agents/ (specialized agents): e.g. architect, developer, test engineer, security engineer, reviewer — each with a clearly defined role and restrictions
  • rules/ (modular Rules): topic-specific instructions loaded automatically for all agents — unconditional (always active) or conditional (path-dependent, e.g. only for .rs files)
  • skills/ (reusable Skills): project-specific commands like /project-init, which scans a project and generates a first project CLAUDE.md from languages, frameworks, and test tools
  • settings.json (configuration): enables Agent Teams, defines directories for plans, controls permissions

/project-init only fills in what can be detected automatically; intent fields like architecture and anti-patterns deliberately remain as TODOs — the boundary where human judgment takes over.

Beyond that, you can add your own structures. In my Workflow-Blueprint, for example, we define workflow files that describe different degrees of autonomy — from "develop-review with manual approval per commit" to "fully autonomous implementation after plan approval." The Lead reads these files and offers them to the user as a choice. This isn't a Claude Code feature, but a convention controlled via the CLAUDE.md.

Blueprints mature — not all at once

My first Blueprint was an experiment: five agents with fixed roles, a rigid sequence, and extensive hooks — automatically executed scripts — for pre-commit validation. With it, I rebuilt rlsp from scratch over and over — at first purely throwaway attempts to understand how the agents work under different conditions.

Out of that grew the Workflow-Blueprint. Instead of rigid sequences, users choose between different workflows depending on the task — develop-review with manual approval, a fully autonomous variant, etc. The hooks fell away — instead, conditional Rules handle quality enforcement. This is the Blueprint we use as a team on the client project today.

After it had proven itself in practice, the Autonomous-Blueprint emerged — specifically for hands-off programming. The Lead takes over planning and task decomposition itself and controls when advisory agents are brought in. Optimized for throughput with minimal human interaction.

After several attempts, something serious came out of it: rlsp (Rust Language Server Project) — lean language servers in Rust, currently one for YAML. The majority of the code in it came about with the Autonomous-Blueprint; by now rlsp is standalone, with extensions for VS Code and Zed. I designed the architecture and the rules — the agents wrote the code. Not perfect, but the code quality is in no way inferior to the hand-written parts.

But how do Blueprints actually get better?

At first this was still manual work: observe a bug in the test project, go through the cause with Claude, have the improvement implemented, and copy the .claude/ directory back into the project. That worked as long as the setups were manageable.

With growing complexity, that was no longer enough. Individual bug descriptions via prompt fell short — you had to understand the entire flow: which agents were involved? In what order? Which Rule took effect, which didn't? Improving a Blueprint thus became a structured task in its own right.

In parallel, a new way of working on error analysis emerged: when an agent team produces an error, I have the team itself conduct a retrospective. The agents analyze their own flow — which decisions were made, where the process failed, what the cause was. The result lands as a report in the project. From this report I then derive the concrete improvements to the Blueprint.

An example: the reviewer agent checked code quality — correctness, tests, linting. What it didn't check: whether the code fully implements the plan. In one case, an entire feature infrastructure was built, tested, and waved through review — but never integrated into the server. Dead code, cleanly written. The retrospective identified the gap: the reviewer knew the task, but not the plan. Out of that came the plan reviewer — an agent that checks before approval whether the implementation actually fulfills the plan.

This cycle — error, retrospective, structural improvement — is the actual development process. You don't build the perfect setup right away — you build a system that learns from its own mistakes.

Conditional Rules — quality without hooks

Claude Code offers hooks as a mechanism to run shell commands on certain events — e.g. tests before a commit or validation on completing a task. In my early setups I made heavy use of them.

In practice, a different approach has proven more effective: conditional Rules. These are modular instruction files that activate automatically as soon as an agent touches files of a certain type. If an agent opens a .rs file, the Rust-specific Rules load. If it touches TypeScript, the TypeScript conventions apply. Without manual intervention, without configuration overhead.

The advantage over hooks: Rules work preventively — the agent knows the rules before it writes code. Hooks only act reactively, once the code is already written.

Write rationales — even for agents

One of the most effective insights from working with Blueprints: instructions without a rationale produce brittle obedience. The agent follows the rule — but misunderstands its spirit.

An example from practice:

Weak: "Every test must run in isolation."

Good: "Every test must run in isolation — shared state between tests creates order-dependent failures that only appear in CI and are expensive to debug."

The second sentence gives the agent a basis for trade-offs: if full isolation is expensive in a concrete case, it can judge whether the trade-off is acceptable. Without a rationale, that judgment is missing.

This applies to everything: CLAUDE.md, agent definitions, Rules, workflow descriptions. Those who give reasons get more reliable results.

Static analysis as a guardrail

One pattern runs through all my projects: the stricter the tools, the better the code the agents produce. Static code analysis — compiler, type checking, linters — should therefore be used as broadly and as strictly as possible. Every rule a tool enforces automatically is a guardrail: it works independently of the agent and catches errors before they even reach review.

This effect can be amplified in any language. Strict linter configurations, treating compiler warnings as errors, additional static checks — and in the Blueprint, language-specific Rules (e.g. for Rust, Python, TypeScript, Go) that complement project-specific conventions. That way the agent knows not just the language, but the rules that apply in exactly this project.

Sandboxed execution — the devcontainer as a safety net

Anyone running autonomous agents doesn't want them accidentally writing into their own development environment. A devcontainer template solves this: agents run in an isolated container with their own Claude configuration directory. The host system stays untouched. Your own settings are mounted read-only as a template.


Learnings

What I learned along the way — in brief:

Context is everything. That was our biggest learning experience at the start of the project. Agents work with what they know — a weak CLAUDE.md or a missing project description produces generic results, regardless of the model.

Give reasons, not just rules. Instructions with a rationale are followed more reliably and transfer better to new situations.

Scope tasks correctly. Tasks too small: the coordination overhead outweighs the benefit. Tasks too large: agents work too long without a check-in, errors accumulate. The sweet spot lies with tasks that have a clear, self-contained result — one function, one test file, one review.

Use the Lead consistently as an orchestrator. As soon as the Lead writes code itself instead of delegating, you lose the quality gates of the specialized agents.

Level 2 is hard to let go of. The impulse to step in and do things "quickly yourself" is strong. Autonomy comes from letting go — and from trust in the structure you've built.

Agents take shortcuts. If a process step looks optional, an agent will skip it — even when it shouldn't. Multi-step flows need explicit markers: "always," "without exception," "even if the previous step already succeeded." That sounds redundant — but it's the difference between a process that holds under pressure and one that quietly falls apart.

Quality gates from the start. Conditional Rules, plan-approval workflows, and an independent reviewer are not a nice-to-have. Without them you get results that require costly correction afterward.


Conclusion

Most developers stay at level 2 — AI as faster autocomplete. My path over the past few months led elsewhere: first on the client project, then in rlsp, I built multi-agent setups with Claude Code, bundled into reproducible Blueprints. The decisive lever was never more prompting, but structure — clear roles, built-in guardrails, and a process that learns from its own mistakes. The result is workflows that produce code I no longer write line by line myself — and a change of role: from developing toward designing systems that reliably deliver good code.


Getting started & outlook

Anyone who wants to start with autonomous workflows doesn't have to build a complete Blueprint setup right away. The first step is simpler:

  1. Write a CLAUDE.md — for every project. Even if you're still working alone with Claude. It forces you to make goals, constraints, and quality criteria explicit — including rationales.
  2. Create Rules — define language-specific or project-specific rules in .claude/rules/. They load automatically and improve code quality immediately.
  3. First delegation — fully delegate a clearly scoped task and don't intervene until it's finished.

Agent Teams come once these basics are in place.

My long-term vision: a complete cycle in which users write issues and feature requests — and after an approval, a workflow automatically takes over the implementation. I'm still a bit away from that. But the building blocks for it already exist.

The repository with my Blueprint setups is available on GitHub: chdalski/claude_orchestration. It's a living repository — the Blueprints keep evolving with every new experience. Anyone who wants to follow the state described in this article will find it under the tag blog-2026-06.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.