Back to Frontiers

Multimodal AI

Growing54%

Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.

vision-languageGPT-4VGeminiCLIPmultimodalvideo understandingaudio
51
Papers
8
Milestones
$1.0B
Funding
2
Benchmarks

Key Benchmarks

Video-MME

Comprehensive video understanding benchmark

84.8%Human: 92%
Leader: Gemini 2.5 Promedium saturation

MMMU

Massive Multi-discipline Multimodal Understanding benchmark

82.9%Human: 88.6%
Leader: o3medium saturation

Recent Papers

Recent Milestones

MiniMax’s $619M IPO fuels China multimodal push

Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))

Jan 8, 2026fundingImpact: 70/100

OpenAI Plans Real-Time Voice Model and Device

On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.

Jan 5, 2026releaseImpact: 80/100

Qwen-Image-2512 targets Gemini 3 Pro quality

On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.

Dec 31, 2025releaseImpact: 70/100

Gemini 2.0 Flash Thinking Released

Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.

Dec 19, 2024releaseImpact: 85/100

Gemini 2.0 Flash Released

Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.

Dec 11, 2024releaseImpact: 90/100

Sora Public Release

OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.

Dec 9, 2024releaseImpact: 85/100

GPT-4o Released

OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.

May 13, 2024releaseImpact: 92/100

Sora Video Generation Preview

OpenAI previews Sora, capable of generating realistic 60-second videos from text.

Feb 15, 2024breakthroughImpact: 88/100

Leading Organizations

Google
OpenAI
Meta
Anthropic

ArXiv Categories

cs.CVcs.CLcs.LGcs.MM

Related Frontiers