Interpretability
Research papers, repositories, and articles about interpretability
Showing 11 of 11 items
Reasoning Models Generate Societies of Thought
This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.
The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
Shows that standard activation patching mixes the effect of a unit with how it interacts with many others, not just its direct influence. These interaction terms can hide or fake "important" neurons. If you run mechanistic interpretability experiments, this paper says: treat patching results with more skepticism. ([arxiv.org](https://arxiv.org/list/cs.LG/new))
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.
Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
Defines four testable criteria for what a "good" internal thought representation should satisfy, separate from task scores. Finds that current models systematically fail these tests. If you probe activations or build latent-thought pipelines, this gives a sharper evaluation target. ([arxiv.org](https://arxiv.org/list/cs.CL/new))
One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them
This work shows that popular knowledge-editing methods mostly suppress facts instead of truly removing them. A learned binary mask over edited weights can reliably undo many edits and dramatically weaken new ones. If you rely on model editing for safety, this paper says you should treat "erased" facts as buried, not gone.
Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
The authors find sparse "circuits" inside language models that drive math reasoning and selectively strengthen only those pieces. They report up to 11.4% accuracy gains while touching about 1.6% of model components, keeping other skills like MMLU almost unchanged. ([ar5iv.org](https://ar5iv.org/abs/2512.16914))
Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models
The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.
Bi-Orthogonal Factor Decomposition for Vision Transformers
The authors dissect attention in vision transformers into content and position factors using ANOVA and SVD. They show heads specialize into different interaction types and explain why self-supervised models like DINOv2 use attention differently from supervised ones.
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.
PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding
Adapts sparse autoencoders to the "pair" tensors in protein co-folding models by compressing them into token-level features first. Recovers features aligned with biological structure and binding signals. If you care about interpretability beyond plain transformers, this is a useful template. ([arxiv.org](https://arxiv.org/list/cs.LG/new))