Multimodal
Research papers, repositories, and articles about multimodal
Showing 36 of 36 items
huggingface/transformers
The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.
STEP3-VL-10B Technical Report
STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.
Kling-Omni Technical Report
Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.
microsoft/VibeVoice
Open-source frontier voice model stack from Microsoft. Aims at natural, low-latency speech AI that builders can inspect and extend.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Fuses a contrastive vision encoder and a self-supervised encoder, then feeds the combined tokens into a language model. Yields stronger visual understanding and grounding benchmarks.
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.
bytedance/UI-TARS-desktop
UI‑TARS is a full desktop stack for multimodal AI agents, connecting top models with tools, memory, and UI. If you want to ship serious agent apps, this gives you infrastructure instead of starting from scratch.
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
AuditDM trains an "auditor" model that hunts for cases where strong vision-language models disagree. Teams can reuse these hard examples to patch weaknesses without manual labeling.
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
WorldCanvas lets you script rich video scenes using text, object trajectories, and reference images instead of frame-by-frame editing. Use this as a template for building controllable world simulators and advanced video tools.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
Gemma 4 on Edge: Running Multimodal AI on Mobile, Raspberry Pi & IoT Devices
Walks through running Gemma 4’s edge models on phones, Pis, and Jetson boards. Covers quantization, latency numbers, and when to stay off the cloud.
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Hugging Face highlights WorldCanvas for its controllable video generation via text, paths, and reference images. Builders of simulation-style UIs should copy the interface ideas here.
Thinking with Images via Self-Calling Agent
Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))
Let ViT Speak: Generative Language-Image Pre-training
Trains a Vision Transformer to predict language tokens directly from image tokens using a standard language-model objective. Removes contrastive tricks and extra decoders while staying competitive on many multimodal benchmarks. If you maintain vision backbones for language models, this is a simpler pretraining recipe to test. ([huggingface.co](https://huggingface.co/papers/2605.00809))
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V teaches vision-language models when to call external tools, not just how. That cuts unnecessary tool calls, reducing costs while often boosting accuracy on vision tasks.
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Presents DrivePI, a 4D (3D + time) multimodal large model for autonomous driving that unifies perception, prediction, and planning. Instead of separate stacks, DrivePI treats driving as a holistic spatial-temporal understanding problem, ingesting sensor data and outputting both scene interpretations and future trajectories. It’s another sign that end-to-end or semi end-to-end ‘driving MLLMs’ are becoming a serious research direction.
Thinking with Images via Self-Calling Agent
Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Reuses standard one-direction language models to build bidirectional encoders that can handle text and other signals. Bridges chat models and BERT-style encoders.
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Avatar Forcing combines a diffusion-based generator with a preference-optimization trick to drive talking-head avatars in real time. It reacts to a user’s speech and body motion with low latency, producing more expressive, conversational faces without needing labeled interaction data.
MMhops-R1: Multimodal Multi-hop Reasoning
Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data
Introduces MedInsightBench, a benchmark for ‘analytics agents’ that must reason over multimodal medical data—think tables, images, and reports—to extract multi-step clinical insights rather than just answer single questions. The tasks force agents to chain together retrieval, interpretation, and aggregation across data sources, closer to what real analytics workflows look like in hospitals. This is important if you care about LLM agents that move beyond toy QA into realistic decision support.
Blaizzy/mlx-vlm
Lets you run and customize vision-language models on Apple Silicon using MLX. Great for building local image-and-text assistants on a Mac.
SFTok: Bridging the Performance Gap in Discrete Tokenizers
SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Seedance 1.5 pro jointly generates video and sound from one model rather than bolting audio on later. Content teams can use this to explore tightly synced audio-visual experiences.
Multimodal Large Language Models as Image Classifiers
Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.
How Descript enables multilingual video dubbing at scale
Descript uses GPT‑5 series models to balance timing and meaning in dubbed audio. They treat syllable counts and pacing as hard constraints, then layer speech and video generation on top.
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
MiSI-Bench introduces "Microscopic Spatial Intelligence"—the ability to reason about invisible molecular 3D structures—and builds a massive VLM benchmark spanning 163k QA pairs over 4k molecules. Current VLMs lag well behind humans on many tasks, but a tuned 7B model can exceed human performance on some spatial transformations, highlighting both the promise and the need for domain knowledge in scientific AGI.
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
VQRAE introduces a unified visual tokenizer that can simultaneously support high-level multimodal understanding and discrete-token image generation. Building on a pretrained vision encoder and a high-dimensional semantic VQ codebook, it yields continuous semantic features for reasoning and discrete tokens for reconstruction, showing that quantizing semantic encoders with large codebooks can preserve both meaning and detail.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Omni-Attribute is an open-vocabulary attribute encoder that learns to isolate specific visual factors—like style, lighting, or expression—rather than entangling everything into a single holistic embedding. Using curated positive/negative pairs and a dual generative/contrastive objective, it produces attribute-specific embeddings that are better for retrieval, personalization, and compositional image generation.
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
DuetSVG proposes a unified multimodal model that generates both raster images and SVG code jointly, using the image stream to guide SVG token decoding. By letting the model "see" what it’s drawing during generation, it produces vector graphics that are more visually faithful, semantically correct, and syntactically clean than text-only SVG generators.
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
The authors augment multimodal LLMs with a "Video Toolkit" and a STAR (Spatiotemporal Reasoning) framework that orchestrates calls to temporal and spatial tools for video question answering. Instead of treating the video as a black-box embedding, the model actively localizes key regions over time using tools, yielding sizable gains on VideoMME and LongVideoBench when wrapped around GPT-4o.
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
ReViSE defines a new Reason-Informed Video Editing task and benchmark, then introduces a unified video model that edits while continuously self-evaluating its own reasoning. A built-in VLM judges whether the edited video logically satisfies the instruction, providing self-reflective feedback that tightens the link between "understanding" and actual visual edits.