Multimodal AI
Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.
Key Benchmarks
Recent Papers
NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos +3 more
DFlash: Block Diffusion for Flash Speculative Decoding
Jian Chen, Yesheng Liang, Zhijian Liu
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Yosub Shin, Michael Buriek, Boris Sobolev +5 more
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
Sen Wang, Bangwei Liu, Zhenkun Gao +4 more
Future Optical Flow Prediction Improves Robot Control & Video Generation
Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7 more
HeartMuLa: A Family of Open Sourced Music Foundation Models
Dongchao Yang, Yuxin Xie, Yuguo Yin +19 more
Action100M: A Large-scale Video Action Dataset
Delong Chen, Tejaswi Kasarla, Yejin Bang +6 more
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Xingjun Ma, Yixu Wang, Hengyuan Xu +18 more
STEP3-VL-10B Technical Report
Ailin Huang, Chengyuan Yao, Chunrui Han +19 more
STResNet & STYOLO: A New Family of Compact Classification and Object Detection Models for MCUs
Sudhakar Sah, Ravish Kumar
Recent Milestones
ByteDance Seedance 2.0 tops AI video league
On February 10, 2026, Chinese media reported that ByteDance’s Seedance 2.0 AI video generation model, currently in limited testing, is being widely praised for creating movie‑level multi‑scene videos with synchronized audio from text or images. High‑profile creators and game developers called it the “strongest video generation model,” triggering a rally in China’s media and content stocks and sparking debate over copyright and deepfake risks.
ByteDance Seedance 2.0 Ups China’s Video Game
On February 9, 2026, Chinese media highlighted ByteDance’s Seedance 2.0, a next‑generation AI video model that can generate multi‑shot 2K videos with synchronized audio in under a minute. Official materials say the dual‑branch diffusion transformer model delivers cinematic, multi‑scene narratives about 30% faster than rivals like Kuaishou’s Kling, triggering a rally in China’s AI application stocks. ([finance.sina.com.cn](https://finance.sina.com.cn/roll/2026-02-09/doc-inhmesxr9395387.shtml))
Single‑prompt AI spins 16‑minute anime episodes
Hong Kong–based LAiPIC announced on February 8, 2026 that its Doratoon platform can automatically generate up to 16 minutes of continuous, story‑driven anime from a single text prompt. The company says Doratoon uses a proprietary visual intelligence engine and a library of 18 million assets to handle scripting, storyboarding, character design, scene rendering, voice acting and music with minimal human input.
Flux 2 small puts open image models on one GPU
German startup Black Forest Labs released Flux 2 small on January 17, 2026, a pair of image models (9B and 4B parameters) that combine text‑to‑image, editing and multi‑reference generation in a compact architecture. The 4B model runs on consumer GPUs like the RTX 3090 with around 13 GB of VRAM and is licensed under Apache 2.0 for commercial use, while the larger 9B model is non‑commercial.
MiniMax’s $619M IPO fuels China multimodal push
Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))
OpenAI Plans Real-Time Voice Model and Device
On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.
Qwen-Image-2512 targets Gemini 3 Pro quality
On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.
Gemini 2.0 Flash Thinking Released
Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.
Gemini 2.0 Flash Released
Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.
Sora Public Release
OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.
GPT-4o Released
OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.
Sora Video Generation Preview
OpenAI previews Sora, capable of generating realistic 60-second videos from text.