Aadi Behaviour Systems

Where Models Break

March 2026 · 8 min
Dissecting how and where language models fail at chained reasoning.

Here’s something that bothered me.

Ask an AI model: “Can lettuce cause spontaneous abortion?”

It says no. Lettuce is safe during pregnancy.

Now break the question into pieces:

  1. What diseases can contaminated lettuce carry? Listeria
  2. Can Listeria affect pregnancy? Yes it can cause miscarriage

Suddenly the model says yes. It knew about Listeria. It knew about the pregnancy risks. It just couldn’t connect the two on its own.

These are multi-hop reasoning questions where you need to chain two or more pieces of information to get to an answer. This is a large contributing factor to why frontier AI firms are chasing more compute: in order to even work on reasoning (to achieve superintelligence of some form), one has to have the context to construct logic. Another standard example, “Should I invest in this company?” requires understanding the market, the financials, the competitive landscape, and then combining all of it into a judgment. Models are surprisingly bad at this kind of chaining, even when they know all the individual pieces. This is by no means a hidden problem that everyone is chasing to solve. Most recently Moonshot has released a new paper on Attention Residuals where they outline a mechanism where layers can attend to previous representations, rather than simply relying on standard residual connections that moves information sequentially forward.

I want to probe and question the why models fail by examining how they fail. Are they failing in one way or many? And if many, are the patterns predictable?

What I did

I tested language models on 2,417 multi-hop questions from MuSiQue, a benchmark designed specifically for this kind of evaluation. Each question comes with a gold reasoning chain the exact steps needed to reach the answer. For example:

Question: “Who founded the company that distributed the film UHF?”

Step 1: What company distributed UHF? Orion Pictures Step 2: Who founded Orion Pictures? Mike Medavoy

Answer: Mike Medavoy

The questions range from 2-hop to 4-hop, with both chain structures (each step depends on the last) and compositional structures (independent sub-questions that combine at the end).

I ran six conditions across three model scales 1 billion, 3 billion, and 8 billion parameters for 14,502 total predictions. Then I built a classifier that leverages those gold reasoning chains to determine how each wrong answer went wrong. The classifier is fully rule-based no AI in the loop, no subjective judgment. It compares the model’s answer to every step in the gold chain and categorises the failure based on where, if anywhere, the answer matches. Although I am more critical of caveats towards the end, I do want to preface these results by saying these results likely will deviate based on the model family’s. In this case I used Llama MLX models due to my hardware limitations and I am curious and hope to test this across a variety of models and architectures in the future. All generations used temperature 0, greedy decoding, fully deterministic. Run the same question twice, you get the same answer. So the patterns I’m describing aren’t sampling noise. They’re consistent behaviours.

The four failure modes

Every incorrect prediction falls into one of four categories.

Wrong Knowledge
49.2%
Depth Blind
23.4%
Wrong Path
17.9%
Extraction Failure
9.5%

Wrong knowledge — 49.2% of failures

The model retrieves an incorrect fact somewhere in the chain. Its answer doesn’t match any step in the gold path not the final answer, not any intermediate. It’s pulling from the wrong place entirely.

This is the dominant failure mode by a wide margin. Nearly half of all errors are knowledge errors, not reasoning errors.

Depth blindness — 23.4% of failures

The model correctly identifies an entity along the reasoning chain but stops too early and outputs an intermediate answer instead of the final one.

The clearest example: asked “Who is the spouse of the Rabbit Hole’s producer?”, the model answers “Nicole Kidman.” That is the producer the answer to step 1. The actual answer is her spouse Keith Urban. The model solved the first hop but didn’t take the second.

This is a navigational failure, not a knowledge failure. The model almost certainly knows who Nicole Kidman is married to. It just didn’t realise it needed to keep going.

Wrong path — 17.9% of failures

The model’s answer partially overlaps with the reasoning chain but doesn’t clearly match any step. It’s in the right neighbourhood some shared concepts, some keyword overlap but it took a wrong turn somewhere.

Extraction failure — 9.5% of failures

The model actually has the right answer but in the wrong format. It says “the Latin language” when the gold answer is “Latin,” or produces a full sentence when a short phrase was expected. These aren’t reasoning failures. They’re formatting issues. The model did the hard part correctly.

What surprised me

The thing I didn’t expect: the primary bottleneck isn’t reasoning. It’s knowledge.

The conventional narrative is that models struggle to “chain” multiple steps together. And they do depth blindness is a real and interesting phenomenon. But it’s 23% of failures. Wrong knowledge is 49%. Models fail nearly twice as often from pulling the wrong fact as from being unable to navigate the chain. If you’re trying to improve multi-hop performance (in small models at least) this matters. Augmenting knowledge through retrieval, tools, or database access would address the largest chunk of failures. Improving reasoning structure would help, but it’s a smaller slice of the problem.

The patterns aren’t random

This is what I find most interesting. These failures are classifiable, predictable, and structured.

Take depth blindness. Among all depth-blind failures, 77% stop exactly one step short of the answer. Not two. Not three. One.

How far short do depth-blind models stop?
1 hop short
77.1%
2 hops short
16.1%
3 hops short
6.9%

This concentration at one-hop-short tells me something mechanical is happening. It’s not random confusion about where the model is in a reasoning chain. It’s a specific failure: the model resolves the entire chain except the final link, then outputs the second-to-last entity as if it were done. The model can’t seem to distinguish “I found something that answers a sub-question” from “I found the final answer.”

Why I think this matters

When we report that a model scores 40% on a reasoning benchmark, that single number conceals the structure underneath. It doesn’t tell you whether the model failed because it lacked a fact, stopped reasoning too early, followed the wrong path, or just formatted its answer poorly. These are different problems. They need different fixes.

These insights might be obvious to some more familiar with the space however I have been thinking deeply regarding where and, if at all, reasoning is better described in different modalities or outside of a scaling context. In much larger models whether these problems, scale, morph, or disappear is something I would like to explore in the near future but nonetheless these being the result of the conventional transformer architecture to AI models might point to a more overarching pattern. Perhaps we should be experimenting more with low parameter models to see how efficient and effective one can make the process of reasoning.

Caveats

  • Model coverage: These results are from Meta’s Llama family: Llama 3.2 at 1B and 3B, Llama 3.1 at 8B. The 1B and 8B are different model generations, so some differences may reflect architecture changes rather than pure scale effects. Other model families may show different distributions.
  • Matching is lexical: The classifier uses word overlap. If the model paraphrases an intermediate answer in completely different words, it gets classified as “wrong knowledge” instead of “depth blind.” The depth blindness rate I’m reporting is likely a conservative lower bound.
  • Single benchmark: All of this is on MuSiQue. The questions are extractive short-answer, fact-based. Whether these patterns generalise to mathematical or open-ended reasoning is an open question.
  • Threshold sensitivity: The line between “wrong path” and “depth blind” depends on a 0.5 word-overlap threshold. Edge cases could go either way.

More on this as I extend the analysis to larger models and different architectures.