Representation Learning
Research papers, repositories, and articles about representation learning
Showing 3 of 3 items
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Fuses a contrastive vision encoder and a self-supervised encoder, then feeds the combined tokens into a language model. Yields stronger visual understanding and grounding benchmarks.
Recurrent Video Masked Autoencoders
Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.