Back to Frontiers

Multimodal AI

Growing56%

Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.

vision-languageGPT-4VGeminiCLIPmultimodalvideo understandingaudio
89
Papers
16
Milestones
$1.0B
Funding
2
Benchmarks

Key Benchmarks

Video-MME

Comprehensive video understanding benchmark

84.8%Human: 92%
Leader: Gemini 2.5 Promedium saturation

MMMU

Massive Multi-discipline Multimodal Understanding benchmark

82.9%Human: 88.6%
Leader: o3medium saturation

Recent Papers

Recent Milestones

Microsoft debuts in-house MAI speech & image AIs

On April 4, 2026, Tech Insider detailed how Microsoft has launched three in‑house MAI models — MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 — following their April 2 release. The models target speech recognition, voice generation and image creation, and are being rolled out via Microsoft’s Foundry and MAI Playground platforms.

Apr 4, 2026releaseImpact: 70/100

Microsoft MAI Models Cut OpenAI Dependence

On April 2–3, 2026, Microsoft’s MAI division rolled out three in‑house foundational models—MAI‑Transcribe‑1 for speech‑to‑text, MAI‑Voice‑1 for speech generation and MAI‑Image‑2 for image generation—through its Foundry and MAI Playground platforms. Coverage on April 3 details aggressive pricing that undercuts rival cloud providers and confirms the models are already being integrated into Copilot, Teams, Bing, PowerPoint and Azure Speech.

Apr 3, 2026releaseImpact: 70/100

Krafton Raon: Open Voice + Vision Model Suite

Krafton on April 2, 2026 launched its new AI model brand ‘Raon’ and released four open-source models on Hugging Face. The suite includes a 9B-parameter speech LLM, a real-time full‑duplex speech chat model, a text‑to‑speech model, and a vision encoder that in some tasks outperforms Google’s SigLIP2.

Apr 2, 2026releaseImpact: 70/100

Microsoft MAI launches cheaper AI model stack

On April 2, 2026, Microsoft AI unveiled three new foundation models — MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2 — for text, speech and image generation. The models are available immediately through Microsoft Foundry and MAI Playground, with pricing pitched as cheaper than rival offerings from OpenAI and Google.

Apr 2, 2026releaseImpact: 70/100

ByteDance Seedance 2.0 tops AI video league

On February 10, 2026, Chinese media reported that ByteDance’s Seedance 2.0 AI video generation model, currently in limited testing, is being widely praised for creating movie‑level multi‑scene videos with synchronized audio from text or images. High‑profile creators and game developers called it the “strongest video generation model,” triggering a rally in China’s media and content stocks and sparking debate over copyright and deepfake risks.

Feb 9, 2026releaseImpact: 80/100

ByteDance Seedance 2.0 Ups China’s Video Game

On February 9, 2026, Chinese media highlighted ByteDance’s Seedance 2.0, a next‑generation AI video model that can generate multi‑shot 2K videos with synchronized audio in under a minute. Official materials say the dual‑branch diffusion transformer model delivers cinematic, multi‑scene narratives about 30% faster than rivals like Kuaishou’s Kling, triggering a rally in China’s AI application stocks. ([finance.sina.com.cn](https://finance.sina.com.cn/roll/2026-02-09/doc-inhmesxr9395387.shtml))

Feb 9, 2026releaseImpact: 80/100

Single‑prompt AI spins 16‑minute anime episodes

Hong Kong–based LAiPIC announced on February 8, 2026 that its Doratoon platform can automatically generate up to 16 minutes of continuous, story‑driven anime from a single text prompt. The company says Doratoon uses a proprietary visual intelligence engine and a library of 18 million assets to handle scripting, storyboarding, character design, scene rendering, voice acting and music with minimal human input.

Feb 8, 2026releaseImpact: 80/100

Flux 2 small puts open image models on one GPU

German startup Black Forest Labs released Flux 2 small on January 17, 2026, a pair of image models (9B and 4B parameters) that combine text‑to‑image, editing and multi‑reference generation in a compact architecture. The 4B model runs on consumer GPUs like the RTX 3090 with around 13 GB of VRAM and is licensed under Apache 2.0 for commercial use, while the larger 9B model is non‑commercial.

Jan 17, 2026releaseImpact: 70/100

MiniMax’s $619M IPO fuels China multimodal push

Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))

Jan 8, 2026fundingImpact: 70/100

OpenAI Plans Real-Time Voice Model and Device

On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.

Jan 5, 2026releaseImpact: 80/100

Qwen-Image-2512 targets Gemini 3 Pro quality

On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.

Dec 31, 2025releaseImpact: 70/100

Gemini 2.0 Flash Thinking Released

Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.

Dec 19, 2024releaseImpact: 85/100

Gemini 2.0 Flash Released

Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.

Dec 11, 2024releaseImpact: 90/100

Sora Public Release

OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.

Dec 9, 2024releaseImpact: 85/100

GPT-4o Released

OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.

May 13, 2024releaseImpact: 92/100

Sora Video Generation Preview

OpenAI previews Sora, capable of generating realistic 60-second videos from text.

Feb 15, 2024breakthroughImpact: 88/100

Leading Organizations

Google
OpenAI
Meta
Anthropic

ArXiv Categories

cs.CVcs.CLcs.LGcs.MM

Related Frontiers