The Starting Point
When we at codecentric recently took over a codebase from a previous service provider for a client, it quickly became clear that this would be no ordinary challenge. Backends, frontends, batch jobs, services — a grown application landscape that had been built up layer by layer over the years. The central backend alone comprised more than 70 projects and nearly 350,000 lines of production code. On top of that came numerous additional services that completed the overall picture. What awaited us is something many development teams know from their own experience: a codebase that has grown over years, and a test landscape that didn't always keep pace with that growth.
The state of the existing tests was anything but satisfactory. Tests existed, but a significant portion of them no longer ran without errors. Even more problematic was that the tests were neither automated nor regularly executed in the CI/CD pipeline. As a result, the decay of the tests went unnoticed. Our first step was to clean up this situation, repair all failing tests, and establish a stable baseline. The outcome of that work: a line coverage of around 58% with 6,646 tests. Our target was 80%. The goal wasn't to validate the correctness of the existing production code, but first and foremost to lock in the status quo with tests and to better detect unexpected changes during later modifications to the production code.
With several hundred thousand lines of production code, that means writing even more lines of test code. Since we were already using Claude Code as an AI-assisted tool in our day-to-day project work, it was a natural fit to tackle test generation with it.
We applied this approach to a .NET project. Apart from a few minor .NET-specific challenges, our methodology is fundamentally technology-agnostic.
The First Attempt: Thinking Too Big
Our first prompt was kept deliberately simple: Claude should write unit tests for all projects, extend existing test classes, create new ones where none exist yet, and aim for 90% line coverage. We intentionally set the target at 90% to have a buffer and be able to delete poor tests. What happened next was instructive, if in an uncomfortable way.
Prompt:
Write unit tests for all projects in the solution. The goal is to achieve 90% line coverage. Look at the existing projects and project structure and write the tests analogously to the existing unit tests. If tests already exist for the classes being tested, extend the corresponding test class; otherwise create new test files analogously to the other test files. The production code must not be modified.
Claude started by writing its own code to measure line coverage, even though the project already had a method for that. We initially ignored this and assumed the result wouldn't differ significantly. That turned out to be a mistake.That turned out to be a mistake. Claude generated tests in iterations, with each successive one producing less gain than the previous. When each new round produced less than 0.1% additional coverage, we stopped. After several hours of runtime and enormous token consumption, the result was an increase of just around 2%, alongside thousands of newly created tests. The reason wasn't immediately obvious.
The subsequent analysis revealed two root causes. First, Claude had not recognized which code paths were already covered by the existing tests. Instead of closing coverage gaps, it was massively retesting scenarios that were already covered. Second, the context was simply too large. Trying to process a codebase of this dimension in a single run means the AI can neither keep the big picture in view nor write precise tests at the class level. Both factors together led to the sobering result.
Focused and Iterative
We drew our conclusions from these mistakes. Instead of tackling the entire solution at once, we chose a single project with no existing tests at all as our starting point. Starting from there, we created a skill using the skill builder built into Claude Code. Since the conventions stored up to that point were insufficient or not sufficiently observed, we instructed the skill builder to analyze the existing test projects and derive the corresponding test conventions and patterns, and to store them directly in the skill. It independently found our base classes, the frameworks in use (xUnit, FluentAssertions, NSubstitute), the test structure, and the naming conventions. Within this clearly defined context, we refined the skill until the first project had achieved good test coverage. The result compared to the first attempt was significantly better. The target line coverage of 80% was reached for the individual project.
What followed were four intensive working days with three people and around 15 commits that incrementally improved the skill — not through theoretical deliberation up front, but through real mistakes during real use. Each run revealed something that wasn't working. Every insight immediately flowed back into the skill as a new rule or convention.
Prompt:
Extract all insights from the current session about writing unit tests to achieve the test coverage. Evaluate the insights by relevance for the skill. If the relevance is high enough, add the insight to the skill.
One example is, how line coverage is measured in the project and how Claude can determine which lines are not yet covered. The skill became more precise with each application. This is the crucial difference from a one-off prompt: a skill accumulates knowledge. In the end, we had a tool with which we could reliably and reproducibly reach the 80% mark for each new sub-project.
The .NET solution comprised 72 projects in total. Initially the three of us equipped projects with tests one by one, spending roughly 2 to 3 hours per project per person. This time-consuming approach, where we spent most of our time waiting, quickly led us to run multiple agents in parallel.
Parallel Operation: 48 Agents Simultaneously
Once the first projects had been augmented with tests using the skill, we wanted to process the test generation across all projects as efficiently as possible. Since unit tests are by nature independent of other unit tests, this lends itself very well to parallelization. The solution was to use sub-agents from Claude Code. The principle is that there is a main agent that orchestrates a certain number of sub-agents. Using our skill, we instructed the main agent to spawn a separate sub-agent for each test class. Each of these agents had the task of creating unit tests for a specific class. We capped the number of parallel sub-agents at 8. This brought a significant boost in speed.
From the skill:
Always assign individual agents to write tests (one agent per class, model: "Sonnet"). Do NOT assign multiple classes to one agent. Start up to 8 agents in parallel.
Still, we saw further potential. The hardware resources of our laptops were barely being utilized, so why not squeeze out more? Instead of simply adding more agents, we decided to scale the skill itself.
The solution: Git worktrees. Each of our three developers worked simultaneously on two or more projects in separate worktrees, so the processes didn't block each other. In each of these six worktrees, the skill ran with up to 8 parallel agents. That adds up to a peak of 48 agents writing tests simultaneously.
The Result
What would have kept us occupied manually for months, we achieved as a team of three in four working days: we raised line coverage from 58% to 82% and generated over 16,000 new tests in the process. The path there was not a simple "put in a prompt, get tests out" — it was an iterative process of mistakes, insights, and continuous improvement.
Lessons Learned
Manage Context Deliberately
With large codebases, a single agent accumulates so much context over many iterations that the quality of its output noticeably degrades. The solution: a separate sub-agent is started for each class to be tested, knowing only that one class and terminating after the work is done. The main agent takes on exclusively the controlling role — measuring coverage, selecting the next target class, starting sub-agents, collecting results. This keeps each agent focused, the context clean, and the test quality consistently high.
Controlling Coverage Results
For Claude to review its results and plan the next iteration meaningfully, it needs precise information about where coverage is missing. In the first attempt, we had Claude parse the Cobertura files directly, and we ran into an unexpected problem: the C# compiler internally translates async methods into state machines that are compiled as their own classes. The report generator couldn't handle this and marked large portions of the code as "not coverable." Claude helped us write a small script that cleans the Cobertura files before report generation.
Prompt:
We have a problem with our code coverage measurement: In .runsettings, CompilerGeneratedAttribute is configured as an exclusion. This causes all async method bodies to be completely absent from the coverage report — they are declared as untestable and don't appear at all. The coverage numbers are artificially high as a result, because async code is counted neither as covered nor as uncovered.
The cause: The C# compiler generates state machine classes for async methods (e.g. MyClass.d__5) and display classes for lambdas (e.g. MyClass.<>c__DisplayClass3_0). These carry the CompilerGeneratedAttribute and are therefore completely excluded.
If you simply remove the exclusion, the lines do appear in the report, but as separate class entries — not under the actual class. This makes the report unusable.
Create a PowerShell script to fix the cobertura files so that async methods appear correctly in the coverage of their respective class.
For context management, we also needed a compact analysis tool: it should parse the generated Cobertura files and output, per C# file, the current coverage and the missing lines, sorted descending by coverage percentage, so Claude immediately knows where the greatest leverage lies. Since we couldn't find a suitable tool, we had Claude create this one too.
Prompt:
Create a PowerShell script for better detection of coverage gaps that reads information from the existing Cobertura files beneath the /test-results directory and outputs it in the following compact example format to standard output:
File Coverage Lines Uncovered Service.cs 8.0% 154 / 1916 133-138, 206-211, ... Handler.cs 16.9% 284 / 1682 190-196, ... The files should be sorted descending by number of uncovered lines.
Controlling the Tests Themselves
Not all generated tests were immediately convincing in terms of content. Tests — whether written manually or generated — are of little use if they increase line coverage but contain no meaningful assertions. We therefore equipped the skill with additional rules and capabilities, such as using snapshot testing via Verify for verifying method results.
This also includes having Claude recognize "untestable" code and exclude it from test generation. Unit tests should not, for example, test access to infrastructure (DB queries, network calls, etc.), since that is handled through integration tests. Instead, untestable code should be logged for later review.
Choosing the Right Model
Initially we used the Haiku model to generate the tests. It was significantly cheaper in token consumption and seemed adequate for the start. The results, however, were not reliable enough. The savings simply didn't justify the additional correction effort. We switched to model: "sonnet" and the difference was immediately noticeable. The quality and consistency of the generated code improved substantially.
Keeping an Eye on Costs
The speed at which tests were generated was impressive — as was, unfortunately, the token consumption. In the end, the API costs alone came to around 1,500 USD. Anyone considering this approach should factor that in early and weigh whether the balance of speed, quality, and cost makes sense for their own project.
Sandboxing
Using Claude in sandboxing mode eliminated the constant need to grant permissions. This considerably simplified the parallel work with multiple sub-agents.
Conclusion
Developing a dedicated skill was the decisive step. It enabled us to reach a coverage target that would simply not have been achievable manually in that timeframe — and to do so reproducibly across a large number of sub-projects.
That said, it would be dishonest to gloss over the weaknesses:
The generated test names are often long and technically heavy (MethodName_WhenCondition_ReturnsSpecificResult) and closely follow the logic of the code under test, which was frequently implemented without regard for readability or testability. To be fair, developers without deep system knowledge and domain expertise would likely have struggled just as much.
Without more precise guidance, Claude tends to generate tests with little meaningful content. Often these tests are limited to verifying that an endpoint can be called without error. The corresponding negative test merely ensures that some exception is thrown, without validating the actual result. Such tests improve coverage on paper but don't contribute to genuine quality assurance. To prevent this, it is necessary to guide Claude with precise rules and to review the generated tests.
A large portion of the tests is nonetheless useful and reliably captures business logic and edge cases. To be able to distinguish generated tests from manually written ones, we annotated them with a custom [AiGeneratedTest] attribute. This makes it possible, during later refactorings, to deliberately decide how to evaluate the tests and what to do with them. A small measure with great practical value.
The AI did not reach the targeted 80% code coverage autonomously. Manual scripts for cleanup and gap identification were necessary, which puts the degree of automation provided by Claude Code in perspective. The project's success thus depended not only on the AI-generated test suite, but also on manual debugging of the infrastructure and effective context management.
AI-assisted test generation is no replacement for a well-thought-out testing concept. But as a tool for building a solid test baseline for a large codebase in a short time, it has clearly proven its worth — provided you approach it in a focused and structured way.
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Blog authors
Selvarajah Sivarupan
Senior IT-Consulant
Do you still have questions? Just send me a message.
Kai Lüttmann
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.