16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

5.5.2026 | 11 minutes reading time

The Starting Point

When we at codecentric recently took over a codebase from a previous service provider for a client, it quickly became clear that this would be no ordinary challenge. Backends, frontends, batch jobs, services — a grown application landscape that had been built up layer by layer over the years. The central backend alone comprised more than 70 projects and nearly 350,000 lines of production code. On top of that came numerous additional services that completed the overall picture. What awaited us is something many development teams know from their own experience: a codebase that has grown over years, and a test landscape that didn't always keep pace with that growth.

The state of the existing tests was anything but satisfactory. Tests existed, but a significant portion of them no longer ran without errors. Even more problematic was that the tests were neither automated nor regularly executed in the CI/CD pipeline. As a result, the decay of the tests went unnoticed. Our first step was to clean up this situation, repair all failing tests, and establish a stable baseline. The outcome of that work: a line coverage of around 58% with 6,646 tests. Our target was 80%. The goal wasn't to validate the correctness of the existing production code, but first and foremost to lock in the status quo with tests and to better detect unexpected changes during later modifications to the production code.

With several hundred thousand lines of production code, that means writing even more lines of test code. Since we were already using Claude Code as an AI-assisted tool in our day-to-day project work, it was a natural fit to tackle test generation with it.

We applied this approach to a .NET project. Apart from a few minor .NET-specific challenges, our methodology is fundamentally technology-agnostic.

The First Attempt: Thinking Too Big

Our first prompt was kept deliberately simple: Claude should write unit tests for all projects, extend existing test classes, create new ones where none exist yet, and aim for 90% line coverage. We intentionally set the target at 90% to have a buffer and be able to delete poor tests. What happened next was instructive, if in an uncomfortable way.

Prompt:

Write unit tests for all projects in the solution. The goal is to achieve 90% line coverage. Look at the existing projects and project structure and write the tests analogously to the existing unit tests. If tests already exist for the classes being tested, extend the corresponding test class; otherwise create new test files analogously to the other test files. The production code must not be modified.

Claude started by writing its own code to measure line coverage, even though the project already had a method for that. We initially ignored this and assumed the result wouldn't differ significantly. That turned out to be a mistake.That turned out to be a mistake. Claude generated tests in iterations, with each successive one producing less gain than the previous. When each new round produced less than 0.1% additional coverage, we stopped. After several hours of runtime and enormous token consumption, the result was an increase of just around 2%, alongside thousands of newly created tests. The reason wasn't immediately obvious.

The subsequent analysis revealed two root causes. First, Claude had not recognized which code paths were already covered by the existing tests. Instead of closing coverage gaps, it was massively retesting scenarios that were already covered. Second, the context was simply too large. Trying to process a codebase of this dimension in a single run means the AI can neither keep the big picture in view nor write precise tests at the class level. Both factors together led to the sobering result.

Focused and Iterative

We drew our conclusions from these mistakes. Instead of tackling the entire solution at once, we chose a single project with no existing tests at all as our starting point. Starting from there, we created a skill using the skill builder built into Claude Code. Since the conventions stored up to that point were insufficient or not sufficiently observed, we instructed the skill builder to analyze the existing test projects and derive the corresponding test conventions and patterns, and to store them directly in the skill. It independently found our base classes, the frameworks in use (xUnit, FluentAssertions, NSubstitute), the test structure, and the naming conventions. Within this clearly defined context, we refined the skill until the first project had achieved good test coverage. The result compared to the first attempt was significantly better. The target line coverage of 80% was reached for the individual project.

What followed were four intensive working days with three people and around 15 commits that incrementally improved the skill — not through theoretical deliberation up front, but through real mistakes during real use. Each run revealed something that wasn't working. Every insight immediately flowed back into the skill as a new rule or convention.

Prompt:

Extract all insights from the current session about writing unit tests to achieve the test coverage. Evaluate the insights by relevance for the skill. If the relevance is high enough, add the insight to the skill.

One example is, how line coverage is measured in the project and how Claude can determine which lines are not yet covered. The skill became more precise with each application. This is the crucial difference from a one-off prompt: a skill accumulates knowledge. In the end, we had a tool with which we could reliably and reproducibly reach the 80% mark for each new sub-project.

The .NET solution comprised 72 projects in total. Initially the three of us equipped projects with tests one by one, spending roughly 2 to 3 hours per project per person. This time-consuming approach, where we spent most of our time waiting, quickly led us to run multiple agents in parallel.

Parallel Operation: 48 Agents Simultaneously

Once the first projects had been augmented with tests using the skill, we wanted to process the test generation across all projects as efficiently as possible. Since unit tests are by nature independent of other unit tests, this lends itself very well to parallelization. The solution was to use sub-agents from Claude Code. The principle is that there is a main agent that orchestrates a certain number of sub-agents. Using our skill, we instructed the main agent to spawn a separate sub-agent for each test class. Each of these agents had the task of creating unit tests for a specific class. We capped the number of parallel sub-agents at 8. This brought a significant boost in speed.

From the skill:

Always assign individual agents to write tests (one agent per class, model: "Sonnet"). Do NOT assign multiple classes to one agent. Start up to 8 agents in parallel.

Still, we saw further potential. The hardware resources of our laptops were barely being utilized, so why not squeeze out more? Instead of simply adding more agents, we decided to scale the skill itself.

The solution: Git worktrees. Each of our three developers worked simultaneously on two or more projects in separate worktrees, so the processes didn't block each other. In each of these six worktrees, the skill ran with up to 8 parallel agents. That adds up to a peak of 48 agents writing tests simultaneously.

The Result

What would have kept us occupied manually for months, we achieved as a team of three in four working days: we raised line coverage from 58% to 82% and generated over 16,000 new tests in the process. The path there was not a simple "put in a prompt, get tests out" — it was an iterative process of mistakes, insights, and continuous improvement.

Lessons Learned

Manage Context Deliberately

With large codebases, a single agent accumulates so much context over many iterations that the quality of its output noticeably degrades. The solution: a separate sub-agent is started for each class to be tested, knowing only that one class and terminating after the work is done. The main agent takes on exclusively the controlling role — measuring coverage, selecting the next target class, starting sub-agents, collecting results. This keeps each agent focused, the context clean, and the test quality consistently high.

Controlling Coverage Results

For Claude to review its results and plan the next iteration meaningfully, it needs precise information about where coverage is missing. In the first attempt, we had Claude parse the Cobertura files directly, and we ran into an unexpected problem: the C# compiler internally translates async methods into state machines that are compiled as their own classes. The report generator couldn't handle this and marked large portions of the code as "not coverable." Claude helped us write a small script that cleans the Cobertura files before report generation.

Prompt:

We have a problem with our code coverage measurement: In .runsettings, CompilerGeneratedAttribute is configured as an exclusion. This causes all async method bodies to be completely absent from the coverage report — they are declared as untestable and don't appear at all. The coverage numbers are artificially high as a result, because async code is counted neither as covered nor as uncovered.

The cause: The C# compiler generates state machine classes for async methods (e.g. MyClass.d__5) and display classes for lambdas (e.g. MyClass.<>c__DisplayClass3_0). These carry the CompilerGeneratedAttribute and are therefore completely excluded.

If you simply remove the exclusion, the lines do appear in the report, but as separate class entries — not under the actual class. This makes the report unusable.

Create a PowerShell script to fix the cobertura files so that async methods appear correctly in the coverage of their respective class.

For context management, we also needed a compact analysis tool: it should parse the generated Cobertura files and output, per C# file, the current coverage and the missing lines, sorted descending by coverage percentage, so Claude immediately knows where the greatest leverage lies. Since we couldn't find a suitable tool, we had Claude create this one too.

Prompt:

Create a PowerShell script for better detection of coverage gaps that reads information from the existing Cobertura files beneath the /test-results directory and outputs it in the following compact example format to standard output:

File Coverage Lines Uncovered
Service.cs 8.0% 154 / 1916 133-138, 206-211, ...
Handler.cs 16.9% 284 / 1682 190-196, ...

The files should be sorted descending by number of uncovered lines.

File	Coverage	Lines	Uncovered
Service.cs	8.0%	154 / 1916	133-138, 206-211, ...
Handler.cs	16.9%	284 / 1682	190-196, ...

Controlling the Tests Themselves

Not all generated tests were immediately convincing in terms of content. Tests — whether written manually or generated — are of little use if they increase line coverage but contain no meaningful assertions. We therefore equipped the skill with additional rules and capabilities, such as using snapshot testing via Verify for verifying method results.

This also includes having Claude recognize "untestable" code and exclude it from test generation. Unit tests should not, for example, test access to infrastructure (DB queries, network calls, etc.), since that is handled through integration tests. Instead, untestable code should be logged for later review.

Choosing the Right Model

Initially we used the Haiku model to generate the tests. It was significantly cheaper in token consumption and seemed adequate for the start. The results, however, were not reliable enough. The savings simply didn't justify the additional correction effort. We switched to model: "sonnet" and the difference was immediately noticeable. The quality and consistency of the generated code improved substantially.

Keeping an Eye on Costs

The speed at which tests were generated was impressive — as was, unfortunately, the token consumption. In the end, the API costs alone came to around 1,500 USD. Anyone considering this approach should factor that in early and weigh whether the balance of speed, quality, and cost makes sense for their own project.

Sandboxing

Using Claude in sandboxing mode eliminated the constant need to grant permissions. This considerably simplified the parallel work with multiple sub-agents.

Conclusion

Developing a dedicated skill was the decisive step. It enabled us to reach a coverage target that would simply not have been achievable manually in that timeframe — and to do so reproducibly across a large number of sub-projects.

That said, it would be dishonest to gloss over the weaknesses:

The generated test names are often long and technically heavy (MethodName_WhenCondition_ReturnsSpecificResult) and closely follow the logic of the code under test, which was frequently implemented without regard for readability or testability. To be fair, developers without deep system knowledge and domain expertise would likely have struggled just as much.

Without more precise guidance, Claude tends to generate tests with little meaningful content. Often these tests are limited to verifying that an endpoint can be called without error. The corresponding negative test merely ensures that some exception is thrown, without validating the actual result. Such tests improve coverage on paper but don't contribute to genuine quality assurance. To prevent this, it is necessary to guide Claude with precise rules and to review the generated tests.

A large portion of the tests is nonetheless useful and reliably captures business logic and edge cases. To be able to distinguish generated tests from manually written ones, we annotated them with a custom [AiGeneratedTest] attribute. This makes it possible, during later refactorings, to deliberately decide how to evaluate the tests and what to do with them. A small measure with great practical value.

The AI did not reach the targeted 80% code coverage autonomously. Manual scripts for cleanup and gap identification were necessary, which puts the degree of automation provided by Claude Code in perspective. The project's success thus depended not only on the AI-generated test suite, but also on manual debugging of the infrastructure and effective context management.

AI-assisted test generation is no replacement for a well-thought-out testing concept. But as a tool for building a solid test baseline for a large codebase in a short time, it has clearly proven its worth — provided you approach it in a focused and structured way.

Was this post helpful?

Blog authors

Selvarajah Sivarupan

Senior IT-Consulant

Do you still have questions? Just send me a message.

Kai Lüttmann

Do you still have questions? Just send me a message.

Is Spring Boot Becoming Obsolete?

In March 2026, we kicked off a modernization project for a client. Spring Boot was an obvious choice. There was a strategic decision behind it. There was existing know-how. There was existing infrastructure. The team was set. The work began. One of the...

Generative AI
LLM
AI
Software development
Software architecture

27.4.2026 | 7 minutes reading time

Johannes Barop

EXACT Coding: AI-powered development that prioritizes quality over chaotic...

TL;DR Uncontrolled agentic coding (“vibe coding”) delivers code quickly—and often leads to security and maintenance issues as soon as the software goes live. EXACT Coding (Example-guided AI-Collaborative Test-driven Coding) combines best practices: ....

Generative AI
AI
Test Driven Development

22.4.2026 | 7 minutes reading time

Marco Emrich

Ferdinand Ade

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

Ralph Wiggum is the simple-minded boy from The Simpsons who says things like "I'm learnding!" and eats glue. Of all people, he is now the namesake for a technique for autonomous code generation. The idea behind: If the thought of letting code be generated...

Generative AI
LLM
AI
Software development

6.4.2026 | 7 minutes reading time

Johannes Barop

KubeCon Europe 2026: AI agents go to production

tl;dr A summary of KubeCon Europe 2026: It is the year AI agents move from prototypes to production. This article covers what that means: giving agents verifiable identities, routing inference traffic with the new Gateway API Inference Extension, governing...

Cloud native
AI

31.3.2026 | 11 minutes reading time

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

Note upfront: This article is specifically aimed at teams working on the modernization and further development of existing systems, not at greenfield projects where completely different rules apply. Everyone is talking about the massive productivity ...

Generative AI
AI
DevOps
Test Driven Development
Testing

30.3.2026 | 8 minutes reading time

DeepFake: Detect AI-Generated Images in 5 Steps

We live in a time when an image is no longer a reliable guarantee of truth. AI‑generated content floods social media feeds, news platforms and messenger groups every single day, and only very few people are able to tell the difference. What once required...

IT-Security
AI
Generative AI
Search
Google
data protection
Digitalization

16.3.2026 | 5 minutes reading time

Nested Fixture Pattern for JUnit

JUnit's @Nested classes are usually presented as a way to group related tests. But combined with @RegisterExtension and ExtensionContext.Store, they become something more powerful: a declarative scenario tree where each level adds a scope in which fixtures...

Testing
Java
Software development

9.3.2026 | 11 minutes reading time

Rüdiger zu Dohna

From Stories to Code: How Domain Storytelling and EventStorming Give LLMs...

The Broken Promise of AI-Assisted Development By now, most development teams have tried using an LLM to generate code. The results are familiar: syntactically correct, superficially plausible, and frequently wrong in ways that take hours to diagnose...

4.3.2026 | 15 minutes reading time

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

AI agents are powerful — but they will cheat if you let them. Letting the same agent develop and test your application risks one thing: it will no longer fulfill the specification, it will simply learn to pass the tests. This article shows how to ...

AI
LLM
Testing

2.3.2026 | 12 minutes reading time

Thomas Jaspers

5 reasons we developers misjudge agentic software engineering

Throughout 2025 a kind of trench warfare raged between software developers on the pro and anti-AI development camps. We are, by definition, the experts on software creation. Ironically, this also makes us highly biased, and is exactly the reason you ...

Generative AI
AI

8.1.2026 | 5 minutes reading time

John Fletcher

How-To: Seamless development in WSL2 with git, SSH and podman desktop

Weather you want a more uniform development environment across your team to avoid compatibility issues between different operating systems, want to work closer to your target environment, or need to run a linux exclusive tool like Claude Code, an AI ...

Git
Microsoft
Software development

5.1.2026 | 5 minutes reading time

The developer's dilemma - mastering the transition to AI engineering

Dear software developer, please choose one of the following options for 2026 and beyond:a) finding yourself with obsolete skills, and eventually, unemployed. b) salary increases lower than inflation, whilst expectations of your output continually increase...

AI
Generative AI

1.1.2026 | 11 minutes reading time

John Fletcher

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 minutes reading time

Elisabeth Schulz

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the third and last one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first and second article)The previous articles focused on (i) Microcks’ ...

Testing
API

23.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the second one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first article)While the previous article concentrated on Microcks’ architecture,...

API
Testing

16.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

The Starting Point

The First Attempt: Thinking Too Big

Focused and Iterative

Parallel Operation: 48 Agents Simultaneously

The Result

Lessons Learned

Manage Context Deliberately

Controlling Coverage Results

Controlling the Tests Themselves

Choosing the Right Model

Keeping an Eye on Costs

Sandboxing

Conclusion

Was this post helpful?

Blog authors

More articles in this subject area

Is Spring Boot Becoming Obsolete?

EXACT Coding: AI-powered development that prioritizes quality over chaotic...

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

KubeCon Europe 2026: AI agents go to production

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

DeepFake: Detect AI-Generated Images in 5 Steps

Nested Fixture Pattern for JUnit

From Stories to Code: How Domain Storytelling and EventStorming Give LLMs...

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

5 reasons we developers misjudge agentic software engineering

How-To: Seamless development in WSL2 with git, SSH and podman desktop

The developer's dilemma - mastering the transition to AI engineering

20 years of coding

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Hexagonal Architecture is just an island

Simplifying LLM Application Development: A Newcomer's Perspective

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...