Back to blog

Stop letting AI decide when to run your tests

I built an AI code generation orchestrator where the AI never gets to skip the test suite. Deterministic QA gates, plans in the database, and why prompt engineering won't fix this.

I’ve been using Claude Code and similar tools for a while now. They’re good at writing code. They’re terrible at knowing when their code is correct.

The typical workflow: AI writes code, AI runs the tests (if it feels like it), AI says “looks good” and moves on. Sometimes it runs them. Sometimes it decides the change is “simple enough” to skip. Sometimes it runs them, they fail, and it edits the test to make it pass instead of fixing the actual code.

I got tired of this and built Forge, a tool that takes a different approach: the AI writes code, and then my scripts run. Always. Not “if the AI thinks it’s necessary.” Always.

The problem with letting the model decide

When you give Claude or GPT access to a terminal and tell it to implement a feature, it makes micro-decisions you never see. Should I run npm test? Should I lint? Should I typecheck? The model answers these based on vibes and token budget, not engineering rigor.

I’ve watched Claude skip tests because “the change is straightforward and doesn’t affect existing functionality.” I’ve watched it modify a snapshot test to match the new (broken) output instead of questioning whether the output is correct. I’ve watched it see a test failure and conclude the test was “outdated” rather than that the code was wrong.

This isn’t malicious. The model optimizes for moving forward. But “moving forward” and “shipping correct code” aren’t always the same thing.

How Forge works

Forge is a Next.js 15 app with a structured pipeline:

  1. Pick a task from the plan
  2. Send it to the AI provider (Claude SDK, OpenAI, Claude Code CLI, or a fake provider for testing) with relevant context
  3. AI writes code
  4. Run QA gates: your scripts for lint, typecheck, test, build, whatever
  5. Gates pass? Auto-commit, move to next task
  6. Gates fail? Send failure output back to the AI, go to step 3

The AI never runs the gates itself. It never decides whether to run them. It never sees the gate scripts. It writes code and gets a pass/fail result back.

This is deliberately restrictive. The AI’s job is code generation. My CI scripts decide if the code is acceptable. Same separation of concerns as “developers don’t merge their own PRs.”

Architecture: vertical slices and RTK Query

The codebase uses Next.js 15 with a vertical-slice structure. Each feature (repositories, sessions, plans, qa-gates, dashboard) owns its API routes, UI components, business logic, types, and state:

src/features/
├── repositories/
│   ├── api/         # Next.js API route handlers
│   ├── components/  # React components
│   ├── hooks/       # Custom hooks
│   ├── store/       # RTK Query injected endpoints
│   └── types/
├── sessions/
├── plans/
├── qa-gates/
└── dashboard/

State management uses RTK Query with a pattern where the base API has zero endpoints. Each feature injects its own endpoints at runtime:

// features/plans/store/api.ts
const plansApi = baseApi.injectEndpoints({
  endpoints: (build) => ({
    getPlans: build.query<Plan[], string>({ ... }),
    createPlan: build.mutation<Plan, CreatePlanInput>({ ... }),
  }),
});

This keeps features decoupled. The plans feature doesn’t know about the qa-gates feature. They share a base API but inject their own endpoints independently.

Real-time updates during task execution use Server-Sent Events. When a task is running, the dashboard shows live output from the AI and the QA gate results as they come in.

Plans live in the database

The other design decision: the implementation plan is stored in a database (SQLite for dev, PostgreSQL for prod, both via Drizzle ORM), not in the AI’s context window.

When you ask Claude to implement a feature in a normal conversation, it builds a mental plan, holds it in context, and works through it. But context is volatile. Long conversations drift. The model forgets earlier decisions or subtly changes the approach mid-implementation. And you can’t inspect the plan, pause it, or resume it tomorrow.

In Forge, the plan is a structured object: phases, tasks, acceptance criteria, dependencies. The AI gets one task at a time with just enough context to execute it. When the gates pass, the task is marked done in the database and the next one loads.

This means you can pause at any point, come back tomorrow, and pick up where you left off. The plan is queryable data, not a conversation. You can look at it, edit it, reorder tasks, add acceptance criteria after the fact.

Provider abstraction

Forge supports four providers through a common interface:

  • Claude SDK: Direct API calls to Anthropic
  • OpenAI: Direct API calls to OpenAI
  • Claude Code CLI: Invokes the Claude Code CLI as a subprocess
  • Fake: Returns predefined responses for testing

Swapping providers is an environment variable change. I used this to compare output quality between models on the same tasks with the same QA gates. The results were interesting but that’s a topic for another post.

The fake provider was essential during development. I could test the entire pipeline (plan execution, gate running, commit logic, retry behavior) without spending API credits.

The meta moment

Around the midpoint of development, I pointed Forge at its own repo and let it implement features with QA gates running after each step. The code that came out was measurably better than what I got from freeform AI coding sessions, because every change had to pass lint + typecheck + test + build before it could be committed.

I’ll be honest: the early parts of Forge were “vibe coded” (AI with minimal oversight). The later parts, built with Forge, are cleaner. The difference is the QA gates forcing the model to produce code that actually passes the test suite, not just code that looks right.

It’s not about better prompts

I see a lot of effort going into making AI “more careful” through prompting. “Please be thorough.” “Always run tests.” “Double-check your work.” This is asking the model to be something it structurally isn’t. It’s a next-token predictor. It doesn’t have discipline or habits. It responds to incentives in the conversation, and “skip the tests and move on” is often the locally optimal response.

Deterministic gates aren’t a prompt technique. They’re an architecture decision. The AI physically cannot ship untested code because the pipeline won’t advance without green gates.

Forge isn’t public yet, but the core idea is simple enough: put a script between the AI and the commit. Run it every time. Don’t make it optional.