Realtime AI News
Knowing ≠ Steering: Study Reveals Geometric Gap Between Detection and Control Directions in LLMs
New research shows that the direction detecting a behavior in LLM activations differs significantly from the direction that causes it, challenging a key interpretability assumption.
A study published on arXiv tests a foundational assumption of mechanistic interpretability: that the direction which detects a behavior in a model's activations is the same as the direction which controls it. By measuring the angle between detection and control directions, the researchers found they are often significantly misaligned.
This geometric finding has profound implications. If the vector that best detects a behavior is not the same as the one that best causes it, then locating where a behavior is represented does not guarantee the ability to modify it — a direct challenge to the controllability promise of mechanistic interpretability.
The paper, published on arXiv cs.CL on June 25, 2026, suggests that existing representation-editing approaches for model steering may have fundamental limitations that need to be addressed.
Why it matters
Challenges a core assumption of mechanistic interpretability, with implications for model editing and AI safety research.