We added a single sentence to a diagnostic prompt. What happened next reveals something surprising about how language models actually think under pressure.
Imagine you're an engineer on call at 2 a.m. A cascade failure is eating your infrastructure alive. You open a chat window and type a frantic description of the incident. In the next thirty seconds, an AI either gives you a tight, penetrating analysis — or it gives you a well-formatted list that misses the thing that actually matters.
That gap — between structured output and genuine insight — is exactly what this experiment was designed to measure. And what we found was not what we expected.
The difference wasn't a new model. It wasn't a longer prompt. It wasn't a dozen carefully engineered examples. It was twenty-two words.
"What governs here regardless of improvements elsewhere?"
— The single sentence added to the treatment promptThat question — slipped into the prompt just before the numbered list — turned out to be a kind of cognitive forcing function. It asked the model to find the thing that is true no matter what else you fix. The binding constraint. The law of the system, not just a symptom of its dysfunction.
And across ten real-world outage and audit cases, three different AI models, and sixty total evaluation runs, the answer came back: it works.
Here's what changed. Everything underlined is identical. The gold text is the only addition.
Before diving into what the data shows, a word about what we're measuring. Each response was graded by a judge on a 13-point scale: five binary checks (pass/fail) plus four rubric dimensions scored 0-2. The rubric dimensions — binding constraint accuracy, slow variable recall, intervention quality, and distractor resistance — are the ones that actually test analytical depth.
The binary checks, it turned out, were nearly useless as a signal. Almost every response across every condition passed all five. They're the floor, not the ceiling. The action is entirely in the graded rubric.
The story in that chart is not subtle. Across all three models — GPT, Claude Haiku, and Gemini — the treatment prompt produced higher graded scores. The effect was largest for GPT, which climbed from the low sevens to just under eight on average. Gemini showed the most consistent moderate lift. Claude Haiku improved more modestly and started from a relatively strong baseline.
Here is the "wait, really?" moment of this dataset.
When you look at the four rubric dimensions individually, three of them are pretty stable between baseline and treatment. Binding constraint accuracy was already near-ceiling for most models. Intervention quality was high across the board. Distractor resistance barely budged.
But slow variable recall — the ability to identify the structural, lagging forces that make a system fragile over time — that dimension jumped.
Think about what that means. The treatment prompt didn't just ask the same question more politely. It primed the model to think about system dynamics before answering. And the thing that got better was precisely the dimension most connected to systems thinking: slow variables. The mechanism isn't accidental. It's causal.
Twenty cases. Two domains — cloud infrastructure outages and government audit reviews. The pattern held across both, but with some interesting variation.
The hardest cases — the ones where baseline performance was lowest — showed the largest treatment gains. The CLERK CloudSQL outage and the GH Actions incident, where Gemini's baseline was a 4 and 8 respectively, both climbed meaningfully under treatment. The already-strong cases like LARAVEL Cloudflare and ZA AGSA-PFMA were near the ceiling in both conditions; there wasn't much room to improve.
This is actually a healthy sign. If the treatment had uniformly raised everything, we might suspect a generic quality boost from a slightly better-formatted prompt. Instead, the gains are concentrated where analytical depth was genuinely missing — exactly the pattern you'd expect from a structural thinking intervention.
Not every model responded the same way, and that asymmetry is its own story.
GPT showed the strongest and most consistent treatment effect. Its baseline was already competitive, but the treatment prompt unlocked something extra — particularly on the outage cases, which require sharper causal reasoning about technical systems. Under treatment, GPT reached the highest average graded score of any model-condition combination.
Claude Haiku was the most consistent baseline performer on the audit review cases. It knew its domain. The treatment prompt helped it modestly but didn't transform it — which might mean it was already doing some of the systems-thinking work implicitly, or that its ceiling on this rubric is genuinely higher.
Gemini showed the most room for improvement — and the treatment prompt delivered that improvement most reliably on the audit domain. On the outage cases, Gemini's baseline had some notable weak spots (an 8/15 on GH Actions, a 9/15 on CloudSQL). Treatment brought both up. The question worth investigating: is Gemini more sensitive to prompt framing than the other two, or just starting from a lower floor?
The two source domains in this dataset aren't just different topics — they're different cognitive challenges. Outage cases demand rapid causal diagnosis: what broke, why, what's masking the real constraint. Audit reviews demand pattern recognition across slow institutional processes: what governance failure keeps recurring regardless of individual auditor quality.
The treatment prompt's framing — "what governs here regardless of improvements elsewhere?" — maps slightly more naturally onto the audit domain, where structural constraints are often hiding in plain sight behind process language. And indeed, the treatment lift on audit cases was more uniform: almost every case-model pair improved or held steady.
On outage cases, the lift was real but spikier. The cases where models were already strong (LARAVEL, GH Cache) barely moved. The cases where models were struggling (CLERK, GH Actions for Gemini) jumped. The treatment prompt is best understood as a floor-raiser on hard cases, not a ceiling-raiser on easy ones.
The cognitive science here is worth naming explicitly. The treatment prompt does three things that the baseline doesn't.
First, it introduces a meta-frame before the task. By asking the model to think about governing constraints before producing the numbered list, it activates a different reasoning pathway — one that looks for structural necessity rather than salient symptoms. This is the same difference between a doctor who asks "what must be true for all these symptoms to coexist?" and one who pattern-matches to the most recent case they saw.
Second, it raises the standard for binding constraint identification. The phrase "regardless of improvements elsewhere" is a logical filter. It demands counterfactual reasoning: if I fixed everything else, would this constraint still be the bottleneck? That's a harder question than "what's the main problem?" — and harder questions, answered well, yield better analyses.
Third, it primes slow-variable thinking. Slow variables are, by definition, the things that don't show up loudly in incident logs. They're the architectural decisions made three years ago, the monitoring gap that persisted through four team rotations, the governance framework written before the regulation changed. Explicitly invoking them before the structured response makes models less likely to skip past them.
Twenty-two words — one sentence asking about governing constraints — measurably improved analytical depth across 120 runs, three frontier models, two domains, and twenty diverse cases. The improvement was concentrated exactly where you'd hope: in the hardest cases, on the dimensions most connected to systems thinking, with the models most prone to surface-level analysis.
This is a single-run-per-condition experiment. With one run per case-model-condition combination, we can't separate treatment effect from response variance — any individual score might have come out differently on a second run. The confidence intervals around these averages are wide.
What we'd want to confirm: run each case-model-condition at least five times, take means, and check whether the treatment-baseline delta survives the noise. The pattern here is suggestive enough to warrant that investment. The directionality is consistent across nearly every model and case; it's unlikely to be pure variance. But it isn't proven at publication-quality confidence.
One more caveat worth naming: all scores were assigned by a single judge. The judge's confidence ratings (averaging around 0.87) suggest strong signal, but inter-rater agreement would strengthen the conclusion considerably.
If you run diagnostic prompts against complex technical or governance cases — outages, incident reviews, audit analyses, root-cause investigations — add this sentence before your numbered list:
It costs you nothing. It takes two seconds to add. And across twenty cases, three models, and one hundred twenty evaluations, it consistently produced more structurally rigorous, systems-aware analysis.
The best prompt engineers have always known that the question before the question matters. This experiment quantifies it. The next time your AI gives you a well-formatted list that misses the actual constraint — consider that the problem might not be the model. It might be that nobody told it to look for what's governing the system.
One line. Add it.