Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

2.3.2026 | 11 minutes reading time

AI agents are powerful — but they will cheat if you let them. Letting the same agent develop and test your application risks one thing: it will no longer fulfill the specification, it will simply learn to pass the tests. This article shows how to prevent exactly that — and along the way we will also explore a practical use case for an MCP server and how to use custom slash commands in Claude Code.

To understand where this approach fits in, it is worth taking a quick look at the five levels of AI-assisted coding — which, it seems, are being discussed everywhere right now:

A lot of software developers currently probably land somewhere between Level 2 and Level 3 — and Level 3 is the optimistic estimate here. At Level 2, it's worth asking whether AI is actually accelerating development at all. It might feel faster, but code reviews, manual testing, and constant context-switching can easily eat up whatever time the AI saved.

Time to push for Level 4 - in my opinion Specification Testing with Claude Code is a solid step in that direction.

Setting up the Sample Application

Remember the days when bootstrapping a decent sample application meant hours of scaffolding, config files, and Stack Overflow tabs? Those days are gone. The entire setup for this project was generated with a single prompt in Claude Code — and that prompt itself was produced by an even shorter one, courtesy of Sonnet 4.6.

Let's see if the AI promise holds. For this we use Claude Opus 4.6 — Anthropic's most capable model as of this writing.

💡 Tip A Desktop version of Claude Code is available for macOS, but I'm sticking with the terminal to keep this accessible to everyone — and honestly, there's something satisfying about watching it work there.

Using this specification — essentially a carefully structured prompt — we can generate the complete sample application in one go. The CLAUDE.md file of the coding agent is also available in the repository — along with all other configuration files and scenarios referenced throughout this article.

Five minutes later! Ok, six minutes and 28 seconds later.

A login page, fully wired up with passcode validation — out of the box.

And after entering the correct passcode, our fictional portfolio homepage shows up. This will be our playground for specification testing with Claude Code.

Claude Code - Playwright MCP-Server

Playwright is today one of the most popular frameworks for implementing frontend tests. It automates real browser interactions — navigation, clicks, form input, screenshots — making it the ideal foundation for an AI-powered testing agent.

The Playwright integration for Claude Code comes as an MCP server. Simply add the following file to the root of your project — right next to the CLAUDE.md file. Or - even easier - simply ask Claude Code to create the file.

1// .mcp.json
2{
3  "mcpServers": {
4    "playwright": {
5      "type": "stdio",
6      "command": "npx",
7      "args": ["@playwright/mcp@latest"],
8      "env": {}
9    }
10  }
11}

One of the many cool things about working with AI tools like Claude Code is being able to simply ask whether something is working.

That does not seem to work — but wait. Some things simply do not change, even in the age of AI. We need to restart Claude Code so it picks up the new file.

After the restart, the MCP server is immediately recognized.

To be on the safe side — and because it's fun — let's ask once more.

That looks promising. We are good to go.

Separation of Concerns — The Testing Agent

This is where the core of the approach lies. It is crucial that the agent testing your application is strictly separated from the agent implementing it.

The reason is straightforward: without separation, the coding agent could start optimizing for your test scenarios rather than your specifications. Both artifacts describe the same behavior but serve opposite purposes: the scenarios define exact steps and expected outcomes, while the specification establishes intent and constraints.

This separation can of course be achieved with two completely separate Git repositories — but for small to medium projects that overhead is rarely justified.

The key mechanism — and the elegant solution to this problem without separate repositories: the .claudeignore file.

Very similar to a .gitignore file, you can tell Claude Code to ignore certain directories of the project. If the tests are located in a qa/ subdirectory, simply add a .claudeignore with the following content.

qa/

Now you can set up and start the Claude Code testing agent from the qa/ directory and this gives us exactly what we want — two agents with completely separate contexts.

Let's take a quick look at the directory structure of the testing agent.

qa/
├── .mcp.json              # Playwright MCP server config
├── .claude/
│   └── settings.json      # Permission restrictions
├── CLAUDE.md              # QA context and instructions
├── scenarios/
│   ├── SC-001_login.feature
│   └── SC-002_home.feature
└── reports/               # Test output goes here

It has its own CLAUDE.md file with instructions for the testing agent. Those will of course differ fundamentally from those of the coding agent. You also have the .mcp.json here, as only the testing agent needs access to the Playwright MCP-Server. The coding agent could have its own MCP configuration of course.

The settings.json file is used to ensure the testing agent is not traversing upwards in the directory structure and thus accidentally reading the source code of the project. The scenario-based tests are meant to be blackbox tests.

1{
2  "permissions": {
3    "allow": [
4      "Read(qa/**)",
5      "Write(qa/reports/**)"    
6    ],
7    "deny": [
8      "Read(../src/**)"
9    ]
10  }
11}

Let's summarize the separation of concerns in a short table.

	Coding Agent	Testing Agent
Started from	`project-root/`	`qa/`
CLAUDE.md	Coding context	QA context
Reads	`specs/`, `src/`	`qa/scenarios/` only
Writes	`src/`	`qa/reports/` only
Tools	File system, shell	Playwright MCP, file system
Goal	Implement the spec	Validate the implementation
Knows about tests	No access	Yes
Knows about source code	Yes	No access

Without this technical isolation, a coding agent risks no longer fulfilling the specification — but merely passing the tests. The isolation is therefore not just an architectural decision, but a necessity.

Ultimately this gives us two specialized agents — each with a deliberately limited view of the system. Neither agent has the full picture, and that is intentional. The coding agent knows what to build but not how it will be verified. The testing agent knows how to verify but has no knowledge of the implementation details. The only shared artifact is the running application itself, accessed by the testing agent purely through the browser.

This is essentially the AI equivalent of the classic separation between implementation and validation in software engineering — with the added benefit that the separation is enforced technically, not just by convention or discipline.

The Testing Agent

Now that we have a running application and a separate testing agent, we need to give that agent a clearly defined role and set of constraints.

Asking the agent to describe itself produces a reassuring result:

If you are familiar with Claude Code, you already know that this behavior is configured via the CLAUDE.md file. And of course — we are not writing it manually. We give our intent to the model of our choice (here: Sonnet 4.6) and let it do the detailed work. Additional prompts can be used to fine-tune the result.

It is worth reading through this file carefully, as it defines the testing agent's behavior to a large extent.

Three aspects of the agent instructions are worth highlighting:

Report exactly what you observe is the most important instruction for a testing agent. Without it, Claude tends to infer or assume — which is exactly the behavior you want to suppress in a testing agent.
Do not mark a scenario as passed if not all assertions have been explicitly verified — this addresses a subtle LLM tendency to be optimistic about results. The instruction forces a binary, honest outcome.
Session reset between scenarios is explicit because Claude Code won't do this automatically — and shared session state is one of the most common sources of false positives in E2E testing.

The First Scenario

Before taking a look at the scenario definition, let's execute the first scenario directly with the following prompt:

Run all scenarios in qa/scenarios/SC-001_login.feature and write the report to qa/reports/.

Claude Code then starts doing its magic. The browser opens — headless mode is of course an option — and the agent will ask for permission to execute certain actions.

And this is the result:

And this is the scenario file used to execute the tests. It cannot be more simple in my opinion.

Feature: Login Page Access Control
The application must be protected by a code-based login page.
Only users entering the correct passcode may access protected pages.

Background:
Given I navigate to "http://localhost:4321"

Scenario: Successful login with correct passcode
Given I am on the login page
When I enter passcode "1711"
And I submit the login form
Then I should be redirected to "/home"
And I should see the heading "Thomas Jaspers"

Scenario: Login rejected with incorrect passcode
Given I am on the login page
When I enter passcode "0000"
And I submit the login form
Then I should remain on the login page
And I should see an error message
And the passcode input should be empty

Scenario: Direct navigation to protected page without session
Given I have no active session
When I navigate directly to "/services"
Then I should be redirected to the login page

Scenario: Session persists across page navigation
Given I have logged in with passcode "1711"
When I navigate to "/services"
And I navigate to "/contact"
Then I should not be redirected to the login page

Many will know how tedious and error-prone writing frontend tests manually can be. With the agentic approach we simply specify Tests in natural language and let the agent do the hard work. And here is the best part: as long as the observable behavior of the application does not change, neither do the scenarios — no matter how much the underlying implementation evolves. A significant relief for every developer who has ever maintained a brittle test suite.

And there is a more convenient way to trigger test execution.

Repetitive Test Execution with Commands

Next we need a more convenient way to execute tests repeatedly — ideally something anyone on the team can trigger without knowing the full prompt syntax.

Let's create a slash command for this. We add a new file run-scenario.md to qa/.claude/commands/:

Execute the test scenarios in the specified feature file using Playwright MCP. Run each scenario independently with a fresh browser session. Write the results to qa/reports/ using the report template from CLAUDE.md.

Arguments: $FEATURE_FILE

After restarting Claude Code we can now simply call /run-scenario SC-001_login.feature. Clean and repeatable.

Failing Tests

Now let's intentionally introduce a failing test to verify that the testing agent actually catches failures.

For this we add the following scenario to our feature file — one that is deliberately contradictory: a correct passcode should never result in an error message or keep the user on the login page.

Scenario: Intentionally failing
Given I am on the login page
When I enter passcode "1711"
And I submit the login form
Then I should remain on the login page
And I should see an error message
And the passcode input should be empty

As expected, the generated report clearly marks it as failed:

💡 Tip Claude Code may not pick up changes to scenario files immediately, as file contents are cached in the session context. The reliable workaround is restarting the agent after modifying scenario files. For a more elegant solution, extend the command with an explicit re-read instruction:

Always re-read the feature file from the project directory before executing any scenarios.

The agent found the failure, documented it precisely, and reported it — without any manual intervention. Exactly what this approach is designed for.

Summary

We have seen how to set up two dedicated Claude Code agents — one for coding, one for testing — and how to structure a project that keeps them strictly isolated from one another. This isolation is the core principle of the approach and rests on three pillars:

The .claudeignore file prevents the coding agent from reading the qa/ directory and its test scenarios
The settings.json permission restrictions prevent the testing agent from traversing into the production source code
Dedicated CLAUDE.md files give each agent a clearly defined role, scope, and set of constraints

The reason this separation matters: the coding agent must not know how the tests are performed — that knowledge leads to optimization bias rather than genuine specification compliance. The testing agent must not know anything about the implementation — only then can it perform true black-box validation based purely on observable behavior.

We have also seen how the Playwright MCP server gives the testing agent real browser access, how test scenarios written in Gherkin syntax can be executed by the agent as high-level scenarios, and how Claude Code custom slash commands make repetitive test execution as simple as /run-scenario SC-001_login.feature.

One thing is certain: scenario-based testing of web applications has never been this accessible.

Outlook

We are living in a fast-paced time — and things will only improve from here. The next generation of models will make existing agent setups more capable without requiring changes to the architecture we have built in this article.

The testing setup can of course be integrated into a CI/CD pipeline — closing the loop in a fully agentic development pipeline.

That is the real gain: agents handle the execution, the team makes the decisions. The specification is the bridge between the two.

Was this post helpful?

Blog author

Thomas Jaspers

Senior Software Engineer & AI Enthusiast

Do you still have questions? Just send me a message.

AI Code Review: Why Loops Without Tests Are Dangerous

In Part 1 we sorted out the three market terms: Context, Harness, Loop Engineering. But Addy Osmani himself warns of a concrete risk: loops without verification keep running, even when the output is wrong. "Whoever writes the loop often no longer understands...

AI
Generative AI
Software development
Software architecture

15.7.2026 | 10 minutes reading time

Marcel Mikl

Genie One: How Databricks Is Reshaping Its Data Assistant

Databricks has reworked Genie, moving it from a tool focused on answering questions about data toward one intended to help users act on it. This shift is packaged under Genie One, alongside two related developments, Genie Agents and Genie Ontology, that...

LLM
Generative AI
Big Data
Data
Compliance

14.7.2026 | 4 minutes reading time

Niklas Niggemann

Loop Engineering, Harness Engineering, Context Engineering: what's the...

Boris Cherny, Head of Claude Code at Anthropic, said: "I don't prompt Claude anymore. I write loops that prompt Claude." Only days later, on June 7, 2026, Addy Osmani, Engineering Lead at Google Chrome, turned that into the term Loop Engineering. Since...

AI
Generative AI
Software development

5.7.2026 | 12 minutes reading time

Benjamin Font Pera

Selfhosting AI models in your kuberenetes clusters

AI is on everybody's mind nowadays. While some organizations have the possibility to use externally hosted models from e.g. Anthropic, Google, ..., others might not have those options. There are multiple options to host AI models on your own hardware...

LLM
AI
Compliance
regulatory

3.7.2026 | 7 minutes reading time

Why every redesign breaks your Playwright project — and how three layers...

TL;DR: We show how a structural separation of UI selectors and business logic can look like when using Playwright, adapting the proven Robot Pattern into the Layered Robot Pattern. This way, browser automation can proceed without fear of UI changes. ...

AI
Software development
Frontend
Testing
Pattern
UX/UI
Test Driven Development
Software architecture
Resilience
Webdevelopment
BDD
Android

3.7.2026 | 9 minutes reading time

Lars Jouon

Rebecca Jox

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

A healthcare software solution needs to be developed to aggregate information (e.g., patient data, diagnoses, lab results) from various medical systems and provide it to another component for further processing via a custom-defined API. The system must...

AI
Software development
Integration

27.6.2026 | 8 minutes reading time

Christian Langmann

Autonomous development workflows with Claude Code

Most developers today use AI tools as faster autocomplete. Over the past few months, on a client project, I took a different path: multi-agent setups with Claude Code, where specialized agents work in parallel, review one another, and coordinate on their...

AI
Software development
Generative AI

22.6.2026 | 17 minutes reading time

Christoph Dalski

From prompt to product: Why the design step matters

Anyone working with AI-assisted coding assistants today knows the promise: Type a description, and seconds later a working interface appears. Tools like Cursor, Claude Code, or GitHub Copilot deliver increasingly impressive results. Yet what is convincing...

AI
UX/UI
Frontend
Generative AI

16.6.2026 | 9 minutes reading time

Michel Ehmen

Brainstorming With AI — When to Play Devil's Advocate

Brainstorming With AI — When to Play Devil’s Advocate Part of the series Domain-Driven Design Meets AI. Every project starts with a blank canvas, and the blank canvas is where good ideas go to die. You put 8–12 people in a room, point at an empty whiteboard...

DDD
Generative AI
LLM

15.6.2026 | 10 minutes reading time

Ensuring accessibility with AI: what works today (and what doesn't)

Since June 2025, the Barrierefreiheitsstärkungsgesetz (BFSG), Germany's law implementing the European Accessibility Act, has been in effect. Most teams know they should be doing something about it, but in day-to-day work, the topic usually falls by the...

Accessibility
AI
UX/UI
Testing

2.6.2026 | 11 minutes reading time

Playwright Auth Mocking Done Right: No Runtime Flags, No Factory Patterns...

When you work on a project that uses a third-party authentication provider, you will inevitably face this question: how do I run my Playwright tests without dealing with real login flows? Real authentication involves browser redirects, multi-factor prompts...

Frontend
Testing

28.5.2026 | 8 minutes reading time

Maryna Tochkova

Building MCP Servers with Spring AI

Introduction The Model Context Protocol (MCP) is an open standard that defines how AI models communicate with external tools, services, and data sources. It replaces ad-hoc integrations with a single, well-defined JSON-RPC 2.0 protocol, making it easy...

AI
Software development

17.5.2026 | 5 minutes reading time

Tobias Trelle

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

Modern LLMs demonstrate strong capability in inferring meaning from column names. A tool such as Genie can typically resolve pct_cust_attrit_q to "churn" or map rev_mrr_usd to a"MRR" through pattern recognition alone. On a small, well-structured table...

AI
LLM
Big Data
Database

15.5.2026 | 6 minutes reading time

Niklas Niggemann

AI as a Design Partner — Drafter, Validator, Provocateur

Part of the series Domain-Driven Design Meets AI. The previous post introduced the Synergetic Blueprint as the structured process that turns DDD methods into a coherent end-to-end design flow, and made the case that AI augments every step of it. This...

14.5.2026 | 12 minutes reading time

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

The Old Promise In the late 1970s, Stanford computer scientist Edward Feigenbaum coined the term "Knowledge Engineering". He described it as the process of extracting expert knowledge, structuring it, and making it usable within a software system. Central...

Generative AI
AI
LLM
Software Modernization
Software development

11.5.2026 | 10 minutes reading time

Johannes Barop

Benjamin Font Pera

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

Garbage In, Garbage Out. This computing truism has never been more critical than in the age of AI. Large Language Models don't amplify poor data quality, they wrap it in confident-sounding prose that can mislead even experienced users. As organizations...

Generative AI
LLM
AI
Data

7.5.2026 | 8 minutes reading time

Niklas Niggemann

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

The Starting Point When we at codecentric recently took over a codebase from a previous service provider for a client, it quickly became clear that this would be no ordinary challenge. Backends, frontends, batch jobs, services — a grown application landscape...

AI
Software development
Testing

5.5.2026 | 12 minutes reading time

Selvarajah Sivarupan

The Synergetic Blueprint Revisited — and Why AI Changes Everything

From Workshop to Working Software — the Gap Nobody Talks About Most teams that adopt Domain-Driven Design invest heavily in workshops. Domain Storytelling sessions, EventStorming boards, context mapping exercises — the collaboration is real, and the ...

28.4.2026 | 8 minutes reading time

Is Spring Boot Becoming Obsolete?

In March 2026, we kicked off a modernization project for a client. Spring Boot was an obvious choice. There was a strategic decision behind it. There was existing know-how. There was existing infrastructure. The team was set. The work began. One of the...

Generative AI
LLM
AI
Software development
Software architecture

27.4.2026 | 7 minutes reading time

Johannes Barop

EXACT Coding: AI-powered development that prioritizes quality over chaotic...

TL;DR Uncontrolled agentic coding (“vibe coding”) delivers code quickly—and often leads to security and maintenance issues as soon as the software goes live. EXACT Coding (Example-guided AI-Collaborative Test-driven Coding) combines best practices: ....

Generative AI
AI
Test Driven Development

22.4.2026 | 7 minutes reading time

Marco Emrich

Ferdinand Ade

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

Setting up the Sample Application

Claude Code - Playwright MCP-Server

Separation of Concerns — The Testing Agent

The Testing Agent

The First Scenario

Repetitive Test Execution with Commands

Failing Tests

Summary

Outlook

Was this post helpful?

Blog author

More articles in this subject area

AI Code Review: Why Loops Without Tests Are Dangerous

Genie One: How Databricks Is Reshaping Its Data Assistant

Loop Engineering, Harness Engineering, Context Engineering: what's the...

Selfhosting AI models in your kuberenetes clusters

Why every redesign breaks your Playwright project — and how three layers...

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

Autonomous development workflows with Claude Code

From prompt to product: Why the design step matters

Brainstorming With AI — When to Play Devil's Advocate

Ensuring accessibility with AI: what works today (and what doesn't)

Playwright Auth Mocking Done Right: No Runtime Flags, No Factory Patterns...

Building MCP Servers with Spring AI

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

AI as a Design Partner — Drafter, Validator, Provocateur

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

The Synergetic Blueprint Revisited — and Why AI Changes Everything

Is Spring Boot Becoming Obsolete?

EXACT Coding: AI-powered development that prioritizes quality over chaotic...