Multimodal AI
Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.
Key Benchmarks
Recent Papers
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Taekyung Ki, Sangwon Jang, Jaehyeong Jo +2 more
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Xiaoyan Cong, Haotian Yang, Angtian Wang +4 more
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Xin Lin, Meixi Song, Dizhe Zhang +6 more
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Chaoyang Wang, Kaituo Feng, Dongyang Chen +8 more
SFTok: Bridging the Performance Gap in Discrete Tokenizers
Qihang Rao, Borui Zhang, Wenzhao Zheng +2 more
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
Yuqun Wu, Chih-hao Lin, Henry Che +4 more
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu, Chengzhi Mao, Yaojie Liu +2 more
Next-Embedding Prediction Makes Strong Vision Learners
Sihan Xu, Ziqiao Ma, Wenhao Chai +5 more
Kling-Omni Technical Report
Kling Team, Jialu Chen, Yuanzheng Ci +19 more
EasyV2V: A High-quality Instruction-based Video Editing Framework
Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian +5 more
Recent Milestones
MiniMax’s $619M IPO fuels China multimodal push
Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))
OpenAI Plans Real-Time Voice Model and Device
On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.
Qwen-Image-2512 targets Gemini 3 Pro quality
On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.
Gemini 2.0 Flash Thinking Released
Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.
Gemini 2.0 Flash Released
Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.
Sora Public Release
OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.
GPT-4o Released
OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.
Sora Video Generation Preview
OpenAI previews Sora, capable of generating realistic 60-second videos from text.