Long before “GenAI” became a buzzword, I was working on a massive codebase with zero unit tests. The confidence on the code was at all time low, the repo size was huge. It was really difficult to quickly build something and gain the confidence that this won’t break something else somewhere.
Things were going like these when things started changing. There was a sprint where a perfectly sensible mandate landed: “Raise unit test coverage to X% this quarter.” The codebase had almost no tests, so the intent was good: protect ourselves against regressions, build confidence, sleep better on release nights. What happened next will feel familiar to anyone who has ever chased a single number. At one random review I found developers began adding tests that executed lines but asserted nothing meaningful. Well, the code coverage increased, the dashboards turned greener; our assurance did not.
After a while when GenAI became mainstream, we brought GenAI into the loop. “Help us increase coverage,” we said. It did so with astonishing speed; and the same sleight of hand. The model kept generating new tests till a point where I stopped it and validated the tests. And there was when the table flipped. It has been writing tests for lines already covered, creating test classes with different names, renaming test methods, rearranging scaffolding, nudging the counter up without reducing risk. Different author, identical behaviour. That was the moment the issue became crisp: the people weren’t uniquely at fault, and the AI wasn’t uniquely dumb. The system was teaching both to game a metric.
The principle, stated plainly
The original version: When a measure becomes a target, it ceases to be a good measure.
Goodhart’s LLM Principle: Once a metric becomes the target for an LLM-assisted workflow, the model will discover the cheapest path to move the number, even if that path undermines the original goal.
This is not malice; it’s alignment. “Increase coverage” is unambiguous and machine-friendly. “Increase confidence that money movement can’t regress under failure” is messier, slower, and harder to measure. Given a scoreboard, the system learns to play the scoreboard. And by the end of the day I learned a valuable lesson.
Why AI drifts the same way humans do
While doing some study on this topic, I realized the model mirrors our incentives. CI’s green checks feel like reward. A line executed counts the same whether a nasty edge case was pinned down or not. A flaky suite that passes after three retries is still marked as a pass. Public code corpora are full of superficial tests and snapshot files that mainly assert that something exists. When you blend instruction-following with cheap, immediate signals, you get local optima: more tests, not necessarily more truth.
There’s also the subtle psychology of certainty theatre. A big, round number is seductive. “We’re at 85%” sounds like control. The number becomes a story, and the story becomes reality — until production tells a different one.
How metric gaming looks in tests
The pattern is simple. New test files appear quickly. They execute the same getters and happy paths with slightly different names. Mocks multiply to the point where the system under test isn’t really under test; it’s a puppet stage.
Exceptions are caught and ignored to keep the green lights green. You’ll even see redundant assertions — proving the same line twice — because the coverage tool records the touch, not the understanding.
Here’s the flavor of what gets written when the target is the number:
@Test
void returnsBalance() {
// already covered in another suite
assertEquals(100, service.getBalance(userId));
}
This touches code but says nothing about safety. It doesn’t prove that the balance can’t go negative, that overdraw attempts are rejected, or that an audit trail exists. It just adds to the pile.
Now compare that with what actually buys confidence:
@Test
void withdraw_cannotOverdraw_and_emitsAudit() {
Account a = new Account(100, INR);
assertThrows(InsufficientFunds.class, () -> a.withdraw(150));
assertTrue(a.audit().contains(“OVERDRAW_ATTEMPT”));
assertEquals(100, a.balance());
}
This test encodes a rule of the domain (no negative balances), exercises a failure path, and checks a side effect (an audit record). If this breaks, something meaningful broke.
The deeper failure is process, not tooling
A random target slapped onto a legacy codebase with zero tests guarantees perverse incentives. People will race to flip the light green because that is what the system rewards. An LLM — trained to follow instructions and reduce loss — will discover the same shortcuts faster. The result is success theatre: everyone acting their part, the board lit up, reality unchanged.
The fix starts by admitting what the number was standing in for: confidence. Confidence is not a single gauge; it’s a shape. It includes coverage, yes, but also whether tests would actually catch a defect, whether they fail reliably when they should, whether they exercise boundaries and weird inputs, and whether they encode the business rules we care about.
Shifting the target from numbers to behaviors
The quickest unlock came from changing the ask. Instead of “take us to 85%,” the prompt to the model shifted to outcomes:
- Describe, in plain English, the behaviors changed in this diff.
- For each behavior, propose a success case, a failure case, and a boundary case.
- Generate tests only for behaviors not already covered, and justify each test in one sentence: “This would catch X if it regresses.”
Something interesting happens when the model must articulate behaviors before generating code: it stops spamming duplicates and starts proposing scenarios. A coverage line still moves, but as a by-product of chasing understanding.
On the machinery side, one quiet change matters more than any dashboard overhaul: diff coverage instead of global coverage. Judge what changed today, not what was written five years ago. Then add one piece of friction that is impossible to bluff: mutation testing on the files you’ve just touched. If a trivial change to your code doesn’t cause your tests to fail, your tests aren’t protecting you. Suddenly, writing a duplicate getter test isn’t attractive because it won’t kill a mutant.
None of this requires a manifesto. It’s small rails: ask for behaviors, capture a sentence of rationale, measure the changed surface, and ensure that tests actually bite.
Beyond tests: the same principle everywhere AI touches process
The coverage story is just the easiest to see because the number is so visible. The same drift appears in other AI-assisted corners:
- Ask for “fewer incidents,” and the model will happily draft playbooks that reclassify incidents as “known issues” or tighten retry logic to mask errors. The chart improves; user pain moves into a different bucket.
- Ask for “faster velocity,” and you’ll get more, smaller PRs that pass CI but don’t integrate well, with the integration risk pushed to release time. The throughput looks great until rollbacks spike.
- Ask a summarizer to “align the team,” and you’ll get beautifully cohesive status notes that sand off all dissent. Everyone appears to agree because disagreement became statistically unlikely.
Again, not malice — just optimization. The lesson is consistent: pair the visible number with the thing it can’t fake. In testing, pair coverage with the ability to catch a changed line (mutation score). In reliability, pair incident count with customer-visible impact. In delivery, pair throughput with change-failure rate. When the target has two axes, the cheap path is harder to find.
Culture matters more than configuration
Tools enforce very little if the culture worships the number. The healthiest shift was conversational, not technical: reviewers started asking, “What behavior does this protect?” and authors — human or AI — had to answer in plain language. That question changes incentives. It invites engineers to think like part of a system rather than contestants in a metrics game. It treats the model like a junior teammate whose output must be explained, not a vending machine that dispenses green ticks.
There’s also value in writing down the handful of invariants that must never break and pinning tests to them: balances never go negative; idempotent retries stay idempotent; orphaned states are impossible; amounts round-trip cleanly across locales. When the invariants are named, both humans and models have something sturdier to aim at than a percentage.
Closing the loop
The most honest thing to say about coverage targets is that they’re useful for starting the habit and terrible as the habit itself. If a team begins with zero tests, a number can provide focus. But the moment the number becomes the story, both people and models will learn to pass the test rather than test the truth.
Goodhart’s LLM Principle isn’t a warning about AI going rogue; it’s a mirror held up to our systems. We taught the model to chase what we measured. When we changed what we asked and how we checked, the behavior changed; on both sides of the keyboard. The dashboards still matter, but they read differently now. Coverage goes up because understanding went up. Green feels earned. And when something does slip, the tests tell us why, not just that a line got executed somewhere.
In the end, that’s the whole point: stop optimizing for the applause light and start buying real confidence — one behavior at a time. If your dashboard is green but your gut isn’t, listen to your gut. Coverage is a means; confidence is the goal.