Interpretability

Research papers, repositories, and articles about interpretability

Showing 11 of 11 items

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

Shows that standard activation patching mixes the effect of a unit with how it interacts with many others, not just its direct influence. These interaction terms can hide or fake "important" neurons. If you run mechanistic interpretability experiments, this paper says: treat patching results with more skepticism. ([arxiv.org](https://arxiv.org/list/cs.LG/new))

Sankaran Vaidyanathan, David Arbour

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Defines four testable criteria for what a "good" internal thought representation should satisfy, separate from task scores. Finds that current models systematically fail these tests. If you probe activations or build latent-thought pipelines, this gives a sharper evaluation target. ([arxiv.org](https://arxiv.org/list/cs.CL/new))

Fahd Seddik, Fatemeh Fard

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

This work shows that popular knowledge-editing methods mostly suppress facts instead of truly removing them. A learned binary mask over edited weights can reliably undo many edits and dramatically weaken new ones. If you rely on model editing for safety, this paper says you should treat "erased" facts as buried, not gone.

Ali Holmov, Paul Youssef

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

The authors find sparse "circuits" inside language models that drive math reasoning and selectively strengthen only those pieces. They report up to 11.4% accuracy gains while touching about 1.6% of model components, keeping other skills like MMLU almost unchanged. ([ar5iv.org](https://ar5iv.org/abs/2512.16914))

Nikhil Prakash, Donghao Ren

Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models

The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.

Gerard Yeo, Svetlana Churina

Bi-Orthogonal Factor Decomposition for Vision Transformers

The authors dissect attention in vision transformers into content and position factors using ANOVA and SVD. They show heads specialize into different interaction types and explain why self-supervised models like DINOv2 use attention differently from supervised ones.

Fenil R. Doshi, Thomas Fel

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Studies how mixture-of-experts language models actually route work between experts. Offers tools to inspect which expert fires and why, instead of treating MoE as a black box.

Jeremy Herbst, Jae Hee Lee

PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding

Adapts sparse autoencoders to the "pair" tensors in protein co-folding models by compressing them into token-level features first. Recovers features aligned with biological structure and binding signals. If you care about interpretability beyond plain transformers, this is a useful template. ([arxiv.org](https://arxiv.org/list/cs.LG/new))

Giosue Migliorini, Aristofanis Rontogiannis