Autonomous development workflows with Claude Code

Most developers today use AI tools as faster autocomplete. Over the past few months, on a client project, I took a different path: multi-agent setups with Claude Code, where specialized agents work in parallel, review one another, and coordinate on their own. In this article I describe the journey there — from the initial frustration with missing context handoff, through gradually building reproducible Blueprints, to workflows that produce code I no longer write line by line myself. An experience report with concrete learnings for anyone who wants to take the next step toward autonomous development.

Five levels, one mirror

In early 2026, Dan Shapiro published an article that quickly made the rounds in the industry: The Five Levels from Spicy Autocomplete to the Software Factory. The idea behind it isn't especially original — essentially, the familiar autonomy levels of self-driving cars were transferred to AI software development. But since the industry had no vocabulary of its own for what many of us are living through right now, the framework was soaked up like a sponge.

Here's the overview in brief:

Level	What the AI does	What the human does
0 – Manual	nothing	writes every line itself
1 – Assisted	discrete individual tasks (tests, docs)	controls the entire process
2 – Collaborative development	takes over routine tasks in the flow	leads — owns structure, context & responsibility
3 – Human oversight	becomes the primary developer	reviews diffs instead of writing code
4 – Autonomous execution	implements specifications over longer periods	writes specs, checks results after the fact
5 – Dark factory	turns specs directly into software	barely involved anymore

The crux is at level 2: this is where most "AI-native" developers work today — and that's exactly the problem. Level 2 feels complete. It isn't.

Shapiro estimates that most teams stagnate at level 3. Only a few small teams have reached level 5 so far.

I'd place myself between level 3 and 4 — with a clear lean toward 4. What got me there is less a particular tool than a changed way of thinking about how I work with AI.

The lure of level 2

Level 2 is comfortable — you write code, the AI helps you do it faster and with less typing, sometimes with surprisingly good ideas. The flow is right, and you feel productive.

This perception is also reflected in the numbers: according to a study by the Weizenbaum Institute (Krzywdzinski & Wotschack, 2026), the central goals of using AI remain automating work steps and boosting efficiency. 57 percent of the surveyed companies use AI assistants to support their employees. The freed-up time flows into quality improvements — not into a change in the way of working itself. Doing the same thing faster, not working differently.

But level 2 has a subtle limit: the AI reacts, the human leads — and responsibility for structure, context, and decisions rests entirely with you.

That means: as complexity grows, so does your own effort. At level 2, AI support doesn't automatically scale with the problem.

The turning point

My path to autonomous workflows wasn't a straight line.

The first step was online tools like ChatGPT or Gemini. Helpful — but with a fundamental problem: the project context is missing. You can't know for sure what the model needs to truly understand a question. And even if you did, it simply wouldn't be feasible to manually copy everything relevant into the chat. The result: generic answers to complex questions.

GitHub Copilot was another attempt. Right in the editor, always available — but also only with the current file as context. What wasn't visible didn't exist for Copilot. Personally, the tool got in my way more often than it helped.

The real breakthrough came with agent tools like Claude Code: agents that read project context on their own — files, structures, and dependencies. Suddenly I could ask a question without having to decide myself what the model needs to know for it — the agent handles that. The context is thus much larger than with any online tool — but it only really comes together when you deliberately give the agent additional information: Rules (fixed rules the agent follows automatically), project descriptions, good problem descriptions.

The concrete trigger

A few months ago, a new client project started. The team was asked to evaluate whether Claude Code was usable. My part is primarily the Rust backend — a set of Lambdas and libraries, the largest of them an API based on Axum — and the cloud infrastructure.

At the start it was frustrating. It was hard to give the agent the necessary context — which was a big learning experience on our side. And it was hard to get it to look at a problem from different angles: security, testing, maintainability. You had to initiate everything yourself.

That changed with the release of Opus 4.6 and the experimental "Agent Teams" feature. From that moment on, we could not only deploy specialized agents but interact with them directly — previously they were already running but couldn't be addressed directly, which made debugging harder. By then we had grasped how the pieces fit together, and I began building my own multi-agent setups with Agent Teams — configurations in which different agents, each with their own knowledge and their own perspective, work together.

Before, I worked with Claude in dialogue — question, answer, correction, next question — useful, but at its core not much different from pair programming.

With Agent Teams, the dialogue became a process. A Lead coordinates, Teammates work in parallel, a shared task list keeps the overview. The decisive difference from Subagents — helper agents that only report back to the Lead: Teammates communicate directly with one another. They exchange intermediate results, question each other, and coordinate without my involvement. This is no longer an assistant, but an autonomous development team.

The shift in thinking

The step to level 3/4 isn't just technical — above all, it's mental.

For a long time I thought: "I develop software." With Agent Teams that shifts: "I configure Agent Teams." I write less code — but I decide how the team is set up, which rules apply, which quality criteria take effect. The work becomes more abstract, the impact greater.

Letting go is hard for many — and that's understandable. Developing is fun: solving problems, designing systems, writing elegant code — these aren't activities you gladly hand off. I fully understand that.

Still, I believe: those who don't take this step won't be able to keep up in the long run. For now, you still need to understand how software works under the hood — that remains important for steering agents sensibly and judging their results. But the better the models get, the less you'll be confronted with the details.

The real value shifts: from the ability to write code to the ability to design good systems — and to set up Agent Teams so that they reliably produce good code. A strong IT background remains, in my view, more necessary than ever. Because the more fast-moving software and systems become, the more important it is that people devote their attention to them and keep them maintained — albeit on a different level.

The craft: what a setup looks like

The jump to level 3/4 doesn't happen through more prompting — it happens through structure.

Before I describe the individual building blocks, a quick look at how a typical work cycle feels.

I describe what I want to implement. The Lead asks clarifying questions — few or many, depending on complexity. For tasks in areas familiar to me, that's settled in minutes. For topics where I have no detailed knowledge myself — like the Language Server Protocol for rlsp, a project of my own — I ask the agent questions too, in order to be able to make decisions at all. Then a plan emerges that is reviewed internally before I see it. I look at it, check it, and clear up any questions if something is unclear to me. Once I approve the plan, the agent team starts implementing.

What I do during execution depends on the task. On unfamiliar terrain I watch — sometimes I have the Lead stop between tasks to make sure everything is going according to plan. For routine tasks I look at the commits afterward, sometimes not even that. If something goes fundamentally wrong, I adjust the Blueprint (the copyable setup template) and run the plan from the start again — which at the same time shows whether the change to the setup takes effect.

At first I intervened far more often. By now, even more complex tasks run through reliably. The work shifts: less intervening, more deciding up front.

The Lead — orchestrator, not developer

In the case of Claude Code, the Lead is the central coordinator — the initial session you start via the CLI. It communicates with the user, clarifies tasks, proposes workflows, and coordinates the team. What it explicitly does not do: develop itself.

That sounds simpler than it is. The Lead is configured via <project-directory>/.claude/CLAUDE.md — and unlike the specialized agents, whose permissions can be restricted via frontmatter, it always runs with full permissions. That makes it especially prone to leaving prescribed paths and doing things itself instead of delegating. My experience: it's best to use the Lead consistently as an orchestrator and communication hub. Clear instructions in the CLAUDE.md help with that — including explicit prohibitions.

Blueprints — reproducible setups

A Blueprint is a copy template: a .claude/ directory with everything a project needs for autonomous agent workflows. You copy it into the project, adjust it, and the setup is ready to use.

The building blocks that Claude Code ships with out of the box and that a Blueprint uses:

CLAUDE.md (Lead instructions): role, startup behavior, clarification process — the central configuration file the Lead loads on start
agents/ (specialized agents): e.g. architect, developer, test engineer, security engineer, reviewer — each with a clearly defined role and restrictions
rules/ (modular Rules): topic-specific instructions loaded automatically for all agents — unconditional (always active) or conditional (path-dependent, e.g. only for .rs files)
skills/ (reusable Skills): project-specific commands like /project-init, which scans a project and generates a first project CLAUDE.md from languages, frameworks, and test tools
settings.json (configuration): enables Agent Teams, defines directories for plans, controls permissions

/project-init only fills in what can be detected automatically; intent fields like architecture and anti-patterns deliberately remain as TODOs — the boundary where human judgment takes over.

Beyond that, you can add your own structures. In my Workflow-Blueprint, for example, we define workflow files that describe different degrees of autonomy — from "develop-review with manual approval per commit" to "fully autonomous implementation after plan approval." The Lead reads these files and offers them to the user as a choice. This isn't a Claude Code feature, but a convention controlled via the CLAUDE.md.

Blueprints mature — not all at once

My first Blueprint was an experiment: five agents with fixed roles, a rigid sequence, and extensive hooks — automatically executed scripts — for pre-commit validation. With it, I rebuilt rlsp from scratch over and over — at first purely throwaway attempts to understand how the agents work under different conditions.

Out of that grew the Workflow-Blueprint. Instead of rigid sequences, users choose between different workflows depending on the task — develop-review with manual approval, a fully autonomous variant, etc. The hooks fell away — instead, conditional Rules handle quality enforcement. This is the Blueprint we use as a team on the client project today.

After it had proven itself in practice, the Autonomous-Blueprint emerged — specifically for hands-off programming. The Lead takes over planning and task decomposition itself and controls when advisory agents are brought in. Optimized for throughput with minimal human interaction.

After several attempts, something serious came out of it: rlsp (Rust Language Server Project) — lean language servers in Rust, currently one for YAML. The majority of the code in it came about with the Autonomous-Blueprint; by now rlsp is standalone, with extensions for VS Code and Zed. I designed the architecture and the rules — the agents wrote the code. Not perfect, but the code quality is in no way inferior to the hand-written parts.

But how do Blueprints actually get better?

At first this was still manual work: observe a bug in the test project, go through the cause with Claude, have the improvement implemented, and copy the .claude/ directory back into the project. That worked as long as the setups were manageable.

With growing complexity, that was no longer enough. Individual bug descriptions via prompt fell short — you had to understand the entire flow: which agents were involved? In what order? Which Rule took effect, which didn't? Improving a Blueprint thus became a structured task in its own right.

In parallel, a new way of working on error analysis emerged: when an agent team produces an error, I have the team itself conduct a retrospective. The agents analyze their own flow — which decisions were made, where the process failed, what the cause was. The result lands as a report in the project. From this report I then derive the concrete improvements to the Blueprint.

An example: the reviewer agent checked code quality — correctness, tests, linting. What it didn't check: whether the code fully implements the plan. In one case, an entire feature infrastructure was built, tested, and waved through review — but never integrated into the server. Dead code, cleanly written. The retrospective identified the gap: the reviewer knew the task, but not the plan. Out of that came the plan reviewer — an agent that checks before approval whether the implementation actually fulfills the plan.

This cycle — error, retrospective, structural improvement — is the actual development process. You don't build the perfect setup right away — you build a system that learns from its own mistakes.

Conditional Rules — quality without hooks

Claude Code offers hooks as a mechanism to run shell commands on certain events — e.g. tests before a commit or validation on completing a task. In my early setups I made heavy use of them.

In practice, a different approach has proven more effective: conditional Rules. These are modular instruction files that activate automatically as soon as an agent touches files of a certain type. If an agent opens a .rs file, the Rust-specific Rules load. If it touches TypeScript, the TypeScript conventions apply. Without manual intervention, without configuration overhead.

The advantage over hooks: Rules work preventively — the agent knows the rules before it writes code. Hooks only act reactively, once the code is already written.

Write rationales — even for agents

One of the most effective insights from working with Blueprints: instructions without a rationale produce brittle obedience. The agent follows the rule — but misunderstands its spirit.

An example from practice:

Weak: "Every test must run in isolation."

Good: "Every test must run in isolation — shared state between tests creates order-dependent failures that only appear in CI and are expensive to debug."

The second sentence gives the agent a basis for trade-offs: if full isolation is expensive in a concrete case, it can judge whether the trade-off is acceptable. Without a rationale, that judgment is missing.

This applies to everything: CLAUDE.md, agent definitions, Rules, workflow descriptions. Those who give reasons get more reliable results.

Static analysis as a guardrail

One pattern runs through all my projects: the stricter the tools, the better the code the agents produce. Static code analysis — compiler, type checking, linters — should therefore be used as broadly and as strictly as possible. Every rule a tool enforces automatically is a guardrail: it works independently of the agent and catches errors before they even reach review.

This effect can be amplified in any language. Strict linter configurations, treating compiler warnings as errors, additional static checks — and in the Blueprint, language-specific Rules (e.g. for Rust, Python, TypeScript, Go) that complement project-specific conventions. That way the agent knows not just the language, but the rules that apply in exactly this project.

Sandboxed execution — the devcontainer as a safety net

Anyone running autonomous agents doesn't want them accidentally writing into their own development environment. A devcontainer template solves this: agents run in an isolated container with their own Claude configuration directory. The host system stays untouched. Your own settings are mounted read-only as a template.

Learnings

What I learned along the way — in brief:

Context is everything. That was our biggest learning experience at the start of the project. Agents work with what they know — a weak CLAUDE.md or a missing project description produces generic results, regardless of the model.

Give reasons, not just rules. Instructions with a rationale are followed more reliably and transfer better to new situations.

Scope tasks correctly. Tasks too small: the coordination overhead outweighs the benefit. Tasks too large: agents work too long without a check-in, errors accumulate. The sweet spot lies with tasks that have a clear, self-contained result — one function, one test file, one review.

Use the Lead consistently as an orchestrator. As soon as the Lead writes code itself instead of delegating, you lose the quality gates of the specialized agents.

Level 2 is hard to let go of. The impulse to step in and do things "quickly yourself" is strong. Autonomy comes from letting go — and from trust in the structure you've built.

Agents take shortcuts. If a process step looks optional, an agent will skip it — even when it shouldn't. Multi-step flows need explicit markers: "always," "without exception," "even if the previous step already succeeded." That sounds redundant — but it's the difference between a process that holds under pressure and one that quietly falls apart.

Quality gates from the start. Conditional Rules, plan-approval workflows, and an independent reviewer are not a nice-to-have. Without them you get results that require costly correction afterward.

Conclusion

Most developers stay at level 2 — AI as faster autocomplete. My path over the past few months led elsewhere: first on the client project, then in rlsp, I built multi-agent setups with Claude Code, bundled into reproducible Blueprints. The decisive lever was never more prompting, but structure — clear roles, built-in guardrails, and a process that learns from its own mistakes. The result is workflows that produce code I no longer write line by line myself — and a change of role: from developing toward designing systems that reliably deliver good code.

Getting started & outlook

Anyone who wants to start with autonomous workflows doesn't have to build a complete Blueprint setup right away. The first step is simpler:

Write a CLAUDE.md — for every project. Even if you're still working alone with Claude. It forces you to make goals, constraints, and quality criteria explicit — including rationales.
Create Rules — define language-specific or project-specific rules in .claude/rules/. They load automatically and improve code quality immediately.
First delegation — fully delegate a clearly scoped task and don't intervene until it's finished.

Agent Teams come once these basics are in place.

My long-term vision: a complete cycle in which users write issues and feature requests — and after an approval, a workflow automatically takes over the implementation. I'm still a bit away from that. But the building blocks for it already exist.

The repository with my Blueprint setups is available on GitHub: chdalski/claude_orchestration. It's a living repository — the Blueprints keep evolving with every new experience. Anyone who wants to follow the state described in this article will find it under the tag blog-2026-06.

Was this post helpful?

Blog author

Christoph Dalski

Do you still have questions? Just send me a message.

Managing the Hidden Technical Debt of Generative AI

A well-governed semantic layer gives your organization a foundation it can trust. But the moment you start building GenAI applications on top of that foundation, conversational interfaces, AI-powered data assistants, autonomous agents, a new category...

LLM
AI
Platform engineering
Data

28.7.2026 | 8 minutes reading time

Niklas Niggemann

Discovery Took Longer Than Development - What BMAD Taught Me

I self-published a dessert cookbook this year. It sells on Amazon, to people who already know about it. What it doesn't have is a home on the open web - the QR code on the back cover points to nothing, and Google searches for the title return nothing...

AI
Product management
Frontend
UX/UI

23.7.2026 | 12 minutes reading time

Maryna Tochkova

Agentic Engineering: Where Loops Fail in Practice and Why

Boris Cherny, Head of Claude Code at Anthropic, said a sentence that went through the tech scene: "I don't prompt Claude anymore, I write loops that prompt Claude." That sounds like elegance, like acceleration, like the future. But it also sounds like...

AI
Generative AI
Software development

21.7.2026 | 10 minutes reading time

The Human Side of the AI Transformation

The Human Side of AI Transformation Why we should talk not just about productivity – but also about identity, cognitive load, and the future of professional expertise "AI won't replace developers – but developers who use AI will replace those who don't...

Change Management
Generative AI
Resilience

17.7.2026 | 6 minutes reading time

Melanie Volk

Holistic AI Transformation: 7 challenges beyond tool choice

What is an AI transformation? AI transformation refers to the organizational introduction of AI technologies in a company and the accompanying changes in processes, roles, and competencies. It is not a tool rollout, but the systematic interplay of technology...

AI
Change Management

16.7.2026 | 9 minutes reading time

AI Code Review: Why Loops Without Tests Are Dangerous

In Part 1 we sorted out the three market terms: Context, Harness, Loop Engineering. But Addy Osmani himself warns of a concrete risk: loops without verification keep running, even when the output is wrong. "Whoever writes the loop often no longer understands...

AI
Generative AI
Software development
Software architecture

15.7.2026 | 10 minutes reading time

Marcel Mikl

Genie One: How Databricks Is Reshaping Its Data Assistant

Databricks has reworked Genie, moving it from a tool focused on answering questions about data toward one intended to help users act on it. This shift is packaged under Genie One, alongside two related developments, Genie Agents and Genie Ontology, that...

LLM
Generative AI
Big Data
Data
Compliance

14.7.2026 | 4 minutes reading time

Niklas Niggemann

Loop Engineering, Harness Engineering, Context Engineering: what's the...

Boris Cherny, Head of Claude Code at Anthropic, said: "I don't prompt Claude anymore. I write loops that prompt Claude." Only days later, on June 7, 2026, Addy Osmani, Engineering Lead at Google Chrome, turned that into the term Loop Engineering. Since...

AI
Generative AI
Software development

5.7.2026 | 12 minutes reading time

Benjamin Font Pera

Selfhosting AI models in your kuberenetes clusters

AI is on everybody's mind nowadays. While some organizations have the possibility to use externally hosted models from e.g. Anthropic, Google, ..., others might not have those options. There are multiple options to host AI models on your own hardware...

LLM
AI
Compliance
regulatory

3.7.2026 | 7 minutes reading time

Why every redesign breaks your Playwright project — and how three layers...

TL;DR: We show how a structural separation of UI selectors and business logic can look like when using Playwright, adapting the proven Robot Pattern into the Layered Robot Pattern. This way, browser automation can proceed without fear of UI changes. ...

AI
Software development
Frontend
Testing
Pattern
UX/UI
Test Driven Development
Software architecture
Resilience
Webdevelopment
BDD
Android

3.7.2026 | 9 minutes reading time

Lars Jouon

Rebecca Jox

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

A healthcare software solution needs to be developed to aggregate information (e.g., patient data, diagnoses, lab results) from various medical systems and provide it to another component for further processing via a custom-defined API. The system must...

AI
Software development
Integration

27.6.2026 | 8 minutes reading time

Christian Langmann

From prompt to product: Why the design step matters

Anyone working with AI-assisted coding assistants today knows the promise: Type a description, and seconds later a working interface appears. Tools like Cursor, Claude Code, or GitHub Copilot deliver increasingly impressive results. Yet what is convincing...

AI
UX/UI
Frontend
Generative AI

16.6.2026 | 9 minutes reading time

Michel Ehmen

Brainstorming With AI — When to Play Devil's Advocate

Brainstorming With AI — When to Play Devil’s Advocate Part of the series Domain-Driven Design Meets AI. Every project starts with a blank canvas, and the blank canvas is where good ideas go to die. You put 8–12 people in a room, point at an empty whiteboard...

DDD
Generative AI
LLM

15.6.2026 | 10 minutes reading time

Ensuring accessibility with AI: what works today (and what doesn't)

Since June 2025, the Barrierefreiheitsstärkungsgesetz (BFSG), Germany's law implementing the European Accessibility Act, has been in effect. Most teams know they should be doing something about it, but in day-to-day work, the topic usually falls by the...

Accessibility
AI
UX/UI
Testing

2.6.2026 | 11 minutes reading time

Building MCP Servers with Spring AI

Introduction The Model Context Protocol (MCP) is an open standard that defines how AI models communicate with external tools, services, and data sources. It replaces ad-hoc integrations with a single, well-defined JSON-RPC 2.0 protocol, making it easy...

AI
Software development

17.5.2026 | 5 minutes reading time

Tobias Trelle

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

Modern LLMs demonstrate strong capability in inferring meaning from column names. A tool such as Genie can typically resolve pct_cust_attrit_q to "churn" or map rev_mrr_usd to a"MRR" through pattern recognition alone. On a small, well-structured table...

AI
LLM
Big Data
Database

15.5.2026 | 6 minutes reading time

Niklas Niggemann

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

The Old Promise In the late 1970s, Stanford computer scientist Edward Feigenbaum coined the term "Knowledge Engineering". He described it as the process of extracting expert knowledge, structuring it, and making it usable within a software system. Central...

Generative AI
AI
LLM
Software Modernization
Software development

11.5.2026 | 10 minutes reading time

Johannes Barop

Benjamin Font Pera

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

Garbage In, Garbage Out. This computing truism has never been more critical than in the age of AI. Large Language Models don't amplify poor data quality, they wrap it in confident-sounding prose that can mislead even experienced users. As organizations...

Generative AI
LLM
AI
Data

7.5.2026 | 8 minutes reading time

Niklas Niggemann

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

The Starting Point When we at codecentric recently took over a codebase from a previous service provider for a client, it quickly became clear that this would be no ordinary challenge. Backends, frontends, batch jobs, services — a grown application landscape...

AI
Software development
Testing

5.5.2026 | 12 minutes reading time

Selvarajah Sivarupan

Is Spring Boot Becoming Obsolete?

In March 2026, we kicked off a modernization project for a client. Spring Boot was an obvious choice. There was a strategic decision behind it. There was existing know-how. There was existing infrastructure. The team was set. The work began. One of the...

Generative AI
LLM
AI
Software development
Software architecture

27.4.2026 | 7 minutes reading time

Johannes Barop

Autonomous development workflows with Claude Code

Five levels, one mirror

The lure of level 2

The turning point

The concrete trigger

The shift in thinking

The craft: what a setup looks like

The Lead — orchestrator, not developer

Blueprints — reproducible setups

Blueprints mature — not all at once

Conditional Rules — quality without hooks

Write rationales — even for agents

Static analysis as a guardrail

Sandboxed execution — the devcontainer as a safety net

Learnings

Conclusion

Getting started & outlook

Was this post helpful?

Blog author

More articles in this subject area

Managing the Hidden Technical Debt of Generative AI

Discovery Took Longer Than Development - What BMAD Taught Me

Agentic Engineering: Where Loops Fail in Practice and Why

The Human Side of the AI Transformation

Holistic AI Transformation: 7 challenges beyond tool choice

AI Code Review: Why Loops Without Tests Are Dangerous

Genie One: How Databricks Is Reshaping Its Data Assistant

Loop Engineering, Harness Engineering, Context Engineering: what's the...

Selfhosting AI models in your kuberenetes clusters

Why every redesign breaks your Playwright project — and how three layers...

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

From prompt to product: Why the design step matters

Brainstorming With AI — When to Play Devil's Advocate

Ensuring accessibility with AI: what works today (and what doesn't)

Building MCP Servers with Spring AI

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

Is Spring Boot Becoming Obsolete?