Recurrent Video Masked Autoencoders
Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.
Daniel Zoran, Nikhil Parthasarathy