Multimodal AI
Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.
Key Benchmarks
Recent Papers
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
Daniel C. MacRae, Luuk van der Hoek, Robert van der Wal +5 more
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Ankan Deria, Komal Kumar, Xilin He +4 more
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef +2 more
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Yitong Chen, Zuxuan Wu, Xipeng Qiu +1 more
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Yakov Pyotr Shkolnikov
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Hila Chefer, Patrick Esser, Dominik Lorenz +5 more
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Boqiang Zhang, Lei Ke, Ruihan Yang +5 more
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Lijiang Li, Zuwei Long, Yunhang Shen +6 more
Multimodal Large Language Models as Image Classifiers
Nikita Kisel, Illia Volkov, Klara Janouskova +1 more
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Xiang Zhang, Sohyun Yoo, Hongrui Wu +3 more
Recent Milestones
Microsoft debuts in-house MAI speech & image AIs
On April 4, 2026, Tech Insider detailed how Microsoft has launched three in‑house MAI models — MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 — following their April 2 release. The models target speech recognition, voice generation and image creation, and are being rolled out via Microsoft’s Foundry and MAI Playground platforms.
Microsoft MAI Models Cut OpenAI Dependence
On April 2–3, 2026, Microsoft’s MAI division rolled out three in‑house foundational models—MAI‑Transcribe‑1 for speech‑to‑text, MAI‑Voice‑1 for speech generation and MAI‑Image‑2 for image generation—through its Foundry and MAI Playground platforms. Coverage on April 3 details aggressive pricing that undercuts rival cloud providers and confirms the models are already being integrated into Copilot, Teams, Bing, PowerPoint and Azure Speech.
Krafton Raon: Open Voice + Vision Model Suite
Krafton on April 2, 2026 launched its new AI model brand ‘Raon’ and released four open-source models on Hugging Face. The suite includes a 9B-parameter speech LLM, a real-time full‑duplex speech chat model, a text‑to‑speech model, and a vision encoder that in some tasks outperforms Google’s SigLIP2.
Microsoft MAI launches cheaper AI model stack
On April 2, 2026, Microsoft AI unveiled three new foundation models — MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2 — for text, speech and image generation. The models are available immediately through Microsoft Foundry and MAI Playground, with pricing pitched as cheaper than rival offerings from OpenAI and Google.
ByteDance Seedance 2.0 tops AI video league
On February 10, 2026, Chinese media reported that ByteDance’s Seedance 2.0 AI video generation model, currently in limited testing, is being widely praised for creating movie‑level multi‑scene videos with synchronized audio from text or images. High‑profile creators and game developers called it the “strongest video generation model,” triggering a rally in China’s media and content stocks and sparking debate over copyright and deepfake risks.
ByteDance Seedance 2.0 Ups China’s Video Game
On February 9, 2026, Chinese media highlighted ByteDance’s Seedance 2.0, a next‑generation AI video model that can generate multi‑shot 2K videos with synchronized audio in under a minute. Official materials say the dual‑branch diffusion transformer model delivers cinematic, multi‑scene narratives about 30% faster than rivals like Kuaishou’s Kling, triggering a rally in China’s AI application stocks. ([finance.sina.com.cn](https://finance.sina.com.cn/roll/2026-02-09/doc-inhmesxr9395387.shtml))
Single‑prompt AI spins 16‑minute anime episodes
Hong Kong–based LAiPIC announced on February 8, 2026 that its Doratoon platform can automatically generate up to 16 minutes of continuous, story‑driven anime from a single text prompt. The company says Doratoon uses a proprietary visual intelligence engine and a library of 18 million assets to handle scripting, storyboarding, character design, scene rendering, voice acting and music with minimal human input.
Flux 2 small puts open image models on one GPU
German startup Black Forest Labs released Flux 2 small on January 17, 2026, a pair of image models (9B and 4B parameters) that combine text‑to‑image, editing and multi‑reference generation in a compact architecture. The 4B model runs on consumer GPUs like the RTX 3090 with around 13 GB of VRAM and is licensed under Apache 2.0 for commercial use, while the larger 9B model is non‑commercial.
MiniMax’s $619M IPO fuels China multimodal push
Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))
OpenAI Plans Real-Time Voice Model and Device
On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.
Qwen-Image-2512 targets Gemini 3 Pro quality
On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.
Gemini 2.0 Flash Thinking Released
Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.
Gemini 2.0 Flash Released
Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.
Sora Public Release
OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.
GPT-4o Released
OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.
Sora Video Generation Preview
OpenAI previews Sora, capable of generating realistic 60-second videos from text.