ArXiv Paper

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

Sankaran Vaidyanathan, David Arbour, Aaron Mueller +2June 29, 2026

Summary

Shows that standard activation patching mixes the effect of a unit with how it interacts with many others, not just its direct influence. These interaction terms can hide or fake "important" neurons. If you run mechanistic interpretability experiments, this paper says: treat patching results with more skepticism. ([arxiv.org](https://arxiv.org/list/cs.LG/new))

Topics

interpretability causality

View Original View PDF

Related Content

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X unifies image and video tokenization inside a single vision transformer, then uses that tokenizer to drive a 7B multimodal model. It hits strong scores on both understanding and generation while proposing a better way to edit content by operating in the tokenizer’s latent space instead of inside the language model.