AI agents are powerful — but they will cheat if you let them. Letting the same agent develop and test your application risks one thing: it will no longer fulfill the specification, it will simply learn to pass the tests. This article shows how to prevent exactly that — and along the way we will also explore a practical use case for an MCP server and how to use custom slash commands in Claude Code.
To understand where this approach fits in, it is worth taking a quick look at the five levels of AI-assisted coding — which, it seems, are being discussed everywhere right now:
A lot of software developers currently probably land somewhere between Level 2 and Level 3 — and Level 3 is the optimistic estimate here. At Level 2, it's worth asking whether AI is actually accelerating development at all. It might feel faster, but code reviews, manual testing, and constant context-switching can easily eat up whatever time the AI saved.
Time to push for Level 4 - in my opinion Specification Testing with Claude Code is a solid step in that direction.
Setting up the Sample Application
Remember the days when bootstrapping a decent sample application meant hours of scaffolding, config files, and Stack Overflow tabs? Those days are gone. The entire setup for this project was generated with a single prompt in Claude Code — and that prompt itself was produced by an even shorter one, courtesy of Sonnet 4.6.
Let's see if the AI promise holds. For this we use Claude Opus 4.6 — Anthropic's most capable model as of this writing.
💡 Tip A Desktop version of Claude Code is available for macOS, but I'm sticking with the terminal to keep this accessible to everyone — and honestly, there's something satisfying about watching it work there.
Using this specification —
essentially a carefully structured prompt — we can generate the complete sample application in one go. The
CLAUDE.md
file of the coding agent is also available in the repository — along with all
other configuration files and scenarios referenced throughout this article.
Five minutes later! Ok, six minutes and 28 seconds later.
A login page, fully wired up with passcode validation — out of the box.
And after entering the correct passcode, our fictional portfolio homepage shows up. This will be our playground for specification testing with Claude Code.
Claude Code - Playwright MCP-Server
Playwright is today one of the most popular frameworks for implementing frontend tests. It automates real browser interactions — navigation, clicks, form input, screenshots — making it the ideal foundation for an AI-powered testing agent.
The Playwright integration for Claude Code comes as an MCP server.
Simply add the following file to the root of your project — right next to the CLAUDE.md file. Or - even easier - simply ask Claude Code to create the file.
1// .mcp.json 2{ 3 "mcpServers": { 4 "playwright": { 5 "type": "stdio", 6 "command": "npx", 7 "args": ["@playwright/mcp@latest"], 8 "env": {} 9 } 10 } 11}
One of the many cool things about working with AI tools like Claude Code is being able to simply ask whether something is working.
That does not seem to work — but wait. Some things simply do not change, even in the age of AI. We need to restart Claude Code so it picks up the new file.
After the restart, the MCP server is immediately recognized.
To be on the safe side — and because it's fun — let's ask once more.
That looks promising. We are good to go.
Separation of Concerns — The Testing Agent
This is where the core of the approach lies. It is crucial that the agent testing your application is strictly separated from the agent implementing it.
The reason is straightforward: without separation, the coding agent could start optimizing for your test scenarios rather than your specifications. Both artifacts describe the same behavior but serve opposite purposes: the scenarios define exact steps and expected outcomes, while the specification establishes intent and constraints.
This separation can of course be achieved with two completely separate Git repositories — but for small to medium projects that overhead is rarely justified.
The key mechanism — and the elegant solution to this problem without
separate repositories: the .claudeignore file.
Very similar to a .gitignore file, you can tell Claude Code to ignore certain directories of the project. If the tests are located in a qa/ subdirectory, simply add a .claudeignore with the following content.
qa/
Now you can set up and start the Claude Code testing agent from the qa/ directory and this gives us exactly what we want — two agents with completely separate contexts.
Let's take a quick look at the directory structure of the testing agent.
qa/
├── .mcp.json # Playwright MCP server config
├── .claude/
│ └── settings.json # Permission restrictions
├── CLAUDE.md # QA context and instructions
├── scenarios/
│ ├── SC-001_login.feature
│ └── SC-002_home.feature
└── reports/ # Test output goes here
It has its own CLAUDE.md file with instructions for the testing agent. Those will of course differ fundamentally from those of the coding agent. You also have the .mcp.json here, as only the testing agent needs access to the Playwright MCP-Server. The coding agent could have its own MCP configuration of course.
The settings.json file is used to ensure the testing agent is not traversing upwards in the directory structure and thus accidentally reading the source code of the project. The scenario-based tests are meant to be blackbox tests.
1{ 2 "permissions": { 3 "allow": [ 4 "Read(qa/**)", 5 "Write(qa/reports/**)" 6 ], 7 "deny": [ 8 "Read(../src/**)" 9 ] 10 } 11}
Let's summarize the separation of concerns in a short table.
| Coding Agent | Testing Agent | |
|---|---|---|
| Started from | project-root/ | qa/ |
| CLAUDE.md | Coding context | QA context |
| Reads | specs/, src/ | qa/scenarios/ only |
| Writes | src/ | qa/reports/ only |
| Tools | File system, shell | Playwright MCP, file system |
| Goal | Implement the spec | Validate the implementation |
| Knows about tests | No access | Yes |
| Knows about source code | Yes | No access |
Without this technical isolation, a coding agent risks no longer fulfilling the specification — but merely passing the tests. The isolation is therefore not just an architectural decision, but a necessity.
Ultimately this gives us two specialized agents — each with a deliberately limited view of the system. Neither agent has the full picture, and that is intentional. The coding agent knows what to build but not how it will be verified. The testing agent knows how to verify but has no knowledge of the implementation details. The only shared artifact is the running application itself, accessed by the testing agent purely through the browser.
This is essentially the AI equivalent of the classic separation between implementation and validation in software engineering — with the added benefit that the separation is enforced technically, not just by convention or discipline.
The Testing Agent
Now that we have a running application and a separate testing agent, we need to give that agent a clearly defined role and set of constraints.
Asking the agent to describe itself produces a reassuring result:
If you are familiar with Claude Code, you already know that this behavior is configured via the CLAUDE.md file. And of course — we are not writing it manually. We give our intent to the model of our choice (here: Sonnet 4.6) and let it do the detailed work. Additional prompts can be used to fine-tune the result.
It is worth reading through this file carefully, as it defines the testing agent's behavior to a large extent.
Three aspects of the agent instructions are worth highlighting:
Report exactly what you observe is the most important instruction for a testing agent. Without it, Claude tends to infer or assume — which is exactly the behavior you want to suppress in a testing agent.
Do not mark a scenario as passed if not all assertions have been explicitly verified — this addresses a subtle LLM tendency to be optimistic about results. The instruction forces a binary, honest outcome.
Session reset between scenarios is explicit because Claude Code won't do this automatically — and shared session state is one of the most common sources of false positives in E2E testing.
The First Scenario
Before taking a look at the scenario definition, let's execute the first scenario directly with the following prompt:
Run all scenarios in qa/scenarios/SC-001_login.feature and write the report to qa/reports/.
Claude Code then starts doing its magic. The browser opens — headless mode is of course an option — and the agent will ask for permission to execute certain actions.
And this is the result:
And this is the scenario file used to execute the tests. It cannot be more simple in my opinion.
Feature: Login Page Access Control
The application must be protected by a code-based login page.
Only users entering the correct passcode may access protected pages.Background:
Given I navigate to "http://localhost:4321"Scenario: Successful login with correct passcode
Given I am on the login page
When I enter passcode "1711"
And I submit the login form
Then I should be redirected to "/home"
And I should see the heading "Thomas Jaspers"Scenario: Login rejected with incorrect passcode
Given I am on the login page
When I enter passcode "0000"
And I submit the login form
Then I should remain on the login page
And I should see an error message
And the passcode input should be emptyScenario: Direct navigation to protected page without session
Given I have no active session
When I navigate directly to "/services"
Then I should be redirected to the login pageScenario: Session persists across page navigation
Given I have logged in with passcode "1711"
When I navigate to "/services"
And I navigate to "/contact"
Then I should not be redirected to the login page
Many will know how tedious and error-prone writing frontend tests manually can be. With the agentic approach we simply specify Tests in natural language and let the agent do the hard work. And here is the best part: as long as the observable behavior of the application does not change, neither do the scenarios — no matter how much the underlying implementation evolves. A significant relief for every developer who has ever maintained a brittle test suite.
And there is a more convenient way to trigger test execution.
Repetitive Test Execution with Commands
Next we need a more convenient way to execute tests repeatedly — ideally something anyone on the team can trigger without knowing the full prompt syntax.
Let's create a slash command for this. We add a new file
run-scenario.md to qa/.claude/commands/:
Execute the test scenarios in the specified feature file using Playwright MCP. Run each scenario independently with a fresh browser session. Write the results to qa/reports/ using the report template from CLAUDE.md.
Arguments: $FEATURE_FILE
After restarting Claude Code we can now simply call
/run-scenario SC-001_login.feature. Clean and repeatable.
Failing Tests
Now let's intentionally introduce a failing test to verify that the testing agent actually catches failures.
For this we add the following scenario to our feature file — one that is deliberately contradictory: a correct passcode should never result in an error message or keep the user on the login page.
Scenario: Intentionally failing
Given I am on the login page
When I enter passcode "1711"
And I submit the login form
Then I should remain on the login page
And I should see an error message
And the passcode input should be empty
As expected, the generated report clearly marks it as failed:
💡 Tip Claude Code may not pick up changes to scenario files immediately, as file contents are cached in the session context. The reliable workaround is restarting the agent after modifying scenario files. For a more elegant solution, extend the command with an explicit re-read instruction:
Always re-read the feature file from the project directory before executing any scenarios.
The agent found the failure, documented it precisely, and reported it — without any manual intervention. Exactly what this approach is designed for.
Summary
We have seen how to set up two dedicated Claude Code agents — one for coding, one for testing — and how to structure a project that keeps them strictly isolated from one another. This isolation is the core principle of the approach and rests on three pillars:
- The
.claudeignorefile prevents the coding agent from reading theqa/directory and its test scenarios - The
settings.jsonpermission restrictions prevent the testing agent from traversing into the production source code - Dedicated
CLAUDE.mdfiles give each agent a clearly defined role, scope, and set of constraints
The reason this separation matters: the coding agent must not know how the tests are performed — that knowledge leads to optimization bias rather than genuine specification compliance. The testing agent must not know anything about the implementation — only then can it perform true black-box validation based purely on observable behavior.
We have also seen how the Playwright MCP server gives the testing
agent real browser access, how test scenarios written in Gherkin
syntax can be executed by the agent as high-level scenarios, and how Claude Code custom slash commands make repetitive test execution as simple as /run-scenario SC-001_login.feature.
One thing is certain: scenario-based testing of web applications has never been this accessible.
Outlook
We are living in a fast-paced time — and things will only improve from here. The next generation of models will make existing agent setups more capable without requiring changes to the architecture we have built in this article.
The testing setup can of course be integrated into a CI/CD pipeline — closing the loop in a fully agentic development pipeline.
That is the real gain: agents handle the execution, the team makes the decisions. The specification is the bridge between the two.
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Blog author
Thomas Jaspers
Senior Software Engineer & AI Enthusiast
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.