The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
Shows that standard activation patching mixes the effect of a unit with how it interacts with many others, not just its direct influence. These interaction terms can hide or fake "important" neurons. If you run mechanistic interpretability experiments, this paper says: treat patching results with more skepticism. ([arxiv.org](https://arxiv.org/list/cs.LG/new))
Sankaran Vaidyanathan, David Arbour