A Prompt Engineering Experiment · BottleneckBench · 2026

One Line.
A Measurable Mind-Shift.

We added a single sentence to a diagnostic prompt. What happened next reveals something surprising about how language models actually think under pressure.

120 RUNS · 20 CASES · 3 MODELS · 2 PROMPTS

Imagine you're an engineer on call at 2 a.m. A cascade failure is eating your infrastructure alive. You open a chat window and type a frantic description of the incident. In the next thirty seconds, an AI either gives you a tight, penetrating analysis — or it gives you a well-formatted list that misses the thing that actually matters.

That gap — between structured output and genuine insight — is exactly what this experiment was designed to measure. And what we found was not what we expected.

The difference wasn't a new model. It wasn't a longer prompt. It wasn't a dozen carefully engineered examples. It was twenty-two words.

"What governs here regardless of improvements elsewhere?"

— The single sentence added to the treatment prompt

That question — slipped into the prompt just before the numbered list — turned out to be a kind of cognitive forcing function. It asked the model to find the thing that is true no matter what else you fix. The binding constraint. The law of the system, not just a symptom of its dysfunction.

And across ten real-world outage and audit cases, three different AI models, and sixty total evaluation runs, the answer came back: it works.

The Two Prompts, Side by Side

Here's what changed. Everything underlined is identical. The gold text is the only addition.

⬡ Baseline Prompt Analyze this outage case. Identify:

1. the binding constraint,
2. the slow variables,
3. the top 3 interventions you would prioritize,
4. 2 interventions you would explicitly deprioritize,
5. 3 early-warning signals.

Be concrete and case-specific.
★ Treatment Prompt Analyze this outage case.

Identify the binding constraints and slow variables — What governs here regardless of improvements elsewhere?

Then provide:
1. the binding constraint,
2. the slow variables,
3. the top 3 interventions you would prioritize,
4. 2 interventions you would explicitly deprioritize,
5. 3 early-warning signals.

Be concrete and case-specific.

The Numbers Don't Whisper — They Shout

Before diving into what the data shows, a word about what we're measuring. Each response was graded by a judge on a 13-point scale: five binary checks (pass/fail) plus four rubric dimensions scored 0-2. The rubric dimensions — binding constraint accuracy, slow variable recall, intervention quality, and distractor resistance — are the ones that actually test analytical depth.

The binary checks, it turned out, were nearly useless as a signal. Almost every response across every condition passed all five. They're the floor, not the ceiling. The action is entirely in the graded rubric.

The Treatment Prompt Lifts Every Meaningful Metric
Average graded score (0–8) by condition and model · All cases combined
Baseline
Treatment

The story in that chart is not subtle. Across all three models — GPT, Claude Haiku, and Gemini — the treatment prompt produced higher graded scores. The effect was largest for GPT, which climbed from the low sevens to just under eight on average. Gemini showed the most consistent moderate lift. Claude Haiku improved more modestly and started from a relatively strong baseline.

-
Avg Graded Score
Baseline (all models)
-
Avg Graded Score
Treatment (all models)
-
Mean Lift
per run
-
of case-model pairs
improved or held

The Dimension That Moved Most: Slow Variables

Here is the "wait, really?" moment of this dataset.

When you look at the four rubric dimensions individually, three of them are pretty stable between baseline and treatment. Binding constraint accuracy was already near-ceiling for most models. Intervention quality was high across the board. Distractor resistance barely budged.

But slow variable recall — the ability to identify the structural, lagging forces that make a system fragile over time — that dimension jumped.

Think about what that means. The treatment prompt didn't just ask the same question more politely. It primed the model to think about system dynamics before answering. And the thing that got better was precisely the dimension most connected to systems thinking: slow variables. The mechanism isn't accidental. It's causal.

Slow Variable Recall: The Biggest Mover Across All Models
Average score per rubric dimension (0–2) · Baseline vs. Treatment

Case by Case: Where the Lift Was Real

Twenty cases. Two domains — cloud infrastructure outages and government audit reviews. The pattern held across both, but with some interesting variation.

The hardest cases — the ones where baseline performance was lowest — showed the largest treatment gains. The CLERK CloudSQL outage and the GH Actions incident, where Gemini's baseline was a 4 and 8 respectively, both climbed meaningfully under treatment. The already-strong cases like LARAVEL Cloudflare and ZA AGSA-PFMA were near the ceiling in both conditions; there wasn't much room to improve.

This is actually a healthy sign. If the treatment had uniformly raised everything, we might suspect a generic quality boost from a slightly better-formatted prompt. Instead, the gains are concentrated where analytical depth was genuinely missing — exactly the pattern you'd expect from a structural thinking intervention.

Treatment Gains Are Largest Where Baseline Was Weakest
Total score by case · Each dot = one model's score · Lines connect baseline → treatment for same model

Case-Level Delta Summary

Three Models, Three Personalities

Not every model responded the same way, and that asymmetry is its own story.

GPT showed the strongest and most consistent treatment effect. Its baseline was already competitive, but the treatment prompt unlocked something extra — particularly on the outage cases, which require sharper causal reasoning about technical systems. Under treatment, GPT reached the highest average graded score of any model-condition combination.

Claude Haiku was the most consistent baseline performer on the audit review cases. It knew its domain. The treatment prompt helped it modestly but didn't transform it — which might mean it was already doing some of the systems-thinking work implicitly, or that its ceiling on this rubric is genuinely higher.

Gemini showed the most room for improvement — and the treatment prompt delivered that improvement most reliably on the audit domain. On the outage cases, Gemini's baseline had some notable weak spots (an 8/15 on GH Actions, a 9/15 on CloudSQL). Treatment brought both up. The question worth investigating: is Gemini more sensitive to prompt framing than the other two, or just starting from a lower floor?

GPT Gains the Most Ground; All Three Models Trend Upward
Average total score (0–15) by model and condition

Domain Matters: Outages vs. Audits

The two source domains in this dataset aren't just different topics — they're different cognitive challenges. Outage cases demand rapid causal diagnosis: what broke, why, what's masking the real constraint. Audit reviews demand pattern recognition across slow institutional processes: what governance failure keeps recurring regardless of individual auditor quality.

The treatment prompt's framing — "what governs here regardless of improvements elsewhere?" — maps slightly more naturally onto the audit domain, where structural constraints are often hiding in plain sight behind process language. And indeed, the treatment lift on audit cases was more uniform: almost every case-model pair improved or held steady.

On outage cases, the lift was real but spikier. The cases where models were already strong (LARAVEL, GH Cache) barely moved. The cases where models were struggling (CLERK, GH Actions for Gemini) jumped. The treatment prompt is best understood as a floor-raiser on hard cases, not a ceiling-raiser on easy ones.

The Treatment Prompt Is a Floor-Raiser, Not a Ceiling-Raiser
Distribution of graded scores by condition and domain · Dot = single run

So What Does Twenty-Two Words Actually Do?

The cognitive science here is worth naming explicitly. The treatment prompt does three things that the baseline doesn't.

First, it introduces a meta-frame before the task. By asking the model to think about governing constraints before producing the numbered list, it activates a different reasoning pathway — one that looks for structural necessity rather than salient symptoms. This is the same difference between a doctor who asks "what must be true for all these symptoms to coexist?" and one who pattern-matches to the most recent case they saw.

Second, it raises the standard for binding constraint identification. The phrase "regardless of improvements elsewhere" is a logical filter. It demands counterfactual reasoning: if I fixed everything else, would this constraint still be the bottleneck? That's a harder question than "what's the main problem?" — and harder questions, answered well, yield better analyses.

Third, it primes slow-variable thinking. Slow variables are, by definition, the things that don't show up loudly in incident logs. They're the architectural decisions made three years ago, the monitoring gap that persisted through four team rotations, the governance framework written before the regulation changed. Explicitly invoking them before the structured response makes models less likely to skip past them.

↯ Core Finding

Twenty-two words — one sentence asking about governing constraints — measurably improved analytical depth across 120 runs, three frontier models, two domains, and twenty diverse cases. The improvement was concentrated exactly where you'd hope: in the hardest cases, on the dimensions most connected to systems thinking, with the models most prone to surface-level analysis.

The Honest Caveats

This is a single-run-per-condition experiment. With one run per case-model-condition combination, we can't separate treatment effect from response variance — any individual score might have come out differently on a second run. The confidence intervals around these averages are wide.

What we'd want to confirm: run each case-model-condition at least five times, take means, and check whether the treatment-baseline delta survives the noise. The pattern here is suggestive enough to warrant that investment. The directionality is consistent across nearly every model and case; it's unlikely to be pure variance. But it isn't proven at publication-quality confidence.

One more caveat worth naming: all scores were assigned by a single judge. The judge's confidence ratings (averaging around 0.87) suggest strong signal, but inter-rater agreement would strengthen the conclusion considerably.

What You Should Try Next Week

If you run diagnostic prompts against complex technical or governance cases — outages, incident reviews, audit analyses, root-cause investigations — add this sentence before your numbered list:

★ The Sentence Worth Trying Identify the binding constraints and slow variables — What governs here regardless of improvements elsewhere?

It costs you nothing. It takes two seconds to add. And across twenty cases, three models, and one hundred twenty evaluations, it consistently produced more structurally rigorous, systems-aware analysis.

The best prompt engineers have always known that the question before the question matters. This experiment quantifies it. The next time your AI gives you a well-formatted list that misses the actual constraint — consider that the problem might not be the model. It might be that nobody told it to look for what's governing the system.

One line. Add it.

· · ·

Appendix: The Full Data

All 120 Runs at a Glance
Every case × model × condition · Color encodes total score · Click column headers to sort