Multimodal AI

Explosive88%

Vision-language-audio unification, cross-modal understanding, and unified sequence modeling. Making AI see, hear, and understand the world.

vision-languageGPT-4VGeminiCLIPmultimodalvideo understandingaudio

Papers

Milestones

$1.0B

Funding

Benchmarks

Key Benchmarks

Video-MME

Comprehensive video understanding benchmark

84.8%Human: 92%

Leader: Gemini 2.5 Promedium saturation

MMMU

Massive Multi-discipline Multimodal Understanding benchmark

82.9%Human: 88.6%

Leader: o3medium saturation

Recent Milestones

ByteDance Seedance 2.0 tops AI video league

On February 10, 2026, Chinese media reported that ByteDance’s Seedance 2.0 AI video generation model, currently in limited testing, is being widely praised for creating movie‑level multi‑scene videos with synchronized audio from text or images. High‑profile creators and game developers called it the “strongest video generation model,” triggering a rally in China’s media and content stocks and sparking debate over copyright and deepfake risks.

Feb 9, 2026releaseImpact: 80/100

ByteDance Seedance 2.0 Ups China’s Video Game

On February 9, 2026, Chinese media highlighted ByteDance’s Seedance 2.0, a next‑generation AI video model that can generate multi‑shot 2K videos with synchronized audio in under a minute. Official materials say the dual‑branch diffusion transformer model delivers cinematic, multi‑scene narratives about 30% faster than rivals like Kuaishou’s Kling, triggering a rally in China’s AI application stocks. ([finance.sina.com.cn](https://finance.sina.com.cn/roll/2026-02-09/doc-inhmesxr9395387.shtml))

Feb 9, 2026releaseImpact: 80/100

Single‑prompt AI spins 16‑minute anime episodes

Hong Kong–based LAiPIC announced on February 8, 2026 that its Doratoon platform can automatically generate up to 16 minutes of continuous, story‑driven anime from a single text prompt. The company says Doratoon uses a proprietary visual intelligence engine and a library of 18 million assets to handle scripting, storyboarding, character design, scene rendering, voice acting and music with minimal human input.

Feb 8, 2026releaseImpact: 80/100

Flux 2 small puts open image models on one GPU

German startup Black Forest Labs released Flux 2 small on January 17, 2026, a pair of image models (9B and 4B parameters) that combine text‑to‑image, editing and multi‑reference generation in a compact architecture. The 4B model runs on consumer GPUs like the RTX 3090 with around 13 GB of VRAM and is licensed under Apache 2.0 for commercial use, while the larger 9B model is non‑commercial.

Jan 17, 2026releaseImpact: 70/100

MiniMax’s $619M IPO fuels China multimodal push

Chinese AI startup MiniMax Group said on January 8, 2026 it raised HK$4.82 billion (about $618.6 million) in its Hong Kong IPO by pricing shares at HK$165, the top of its range. The multimodal model developer, founded in 2022 by a former SenseTime executive, plans to spend most proceeds on AI R&D over the next five years.([reuters.com](https://www.reuters.com/world/asia-pacific/chinas-ai-startup-minimax-group-raises-619-million-hong-kong-ipo-2026-01-08/))

Jan 8, 2026fundingImpact: 70/100

OpenAI Plans Real-Time Voice Model and Device

On January 5, 2026, multiple tech outlets reported that OpenAI is preparing a new audio-model architecture for release by the end of Q1 2026, capable of speaking while users talk and handling natural interruptions. The model is reportedly tied to an audio-first personal device designed by Jony Ive’s team and targeted for 2026–2027.

Jan 5, 2026releaseImpact: 80/100

Qwen-Image-2512 targets Gemini 3 Pro quality

On December 31, 2025, Alibaba’s Qwen team released Qwen‑Image‑2512, a new open-source text‑to‑image model update aimed at matching enterprise‑grade image quality from Google’s Gemini 3 Pro Image. The model’s weights, demos and API access became available via Qwen Chat, Hugging Face, ModelScope and Alibaba Cloud.

Dec 31, 2025releaseImpact: 70/100

Gemini 2.0 Flash Thinking Released

Google releases experimental Gemini 2.0 Flash Thinking with enhanced reasoning capabilities.

Dec 19, 2024releaseImpact: 85/100

Gemini 2.0 Flash Released

Google releases Gemini 2.0 Flash with native multimodal understanding and generation, outperforming 1.5 Pro at 2x speed.

Dec 11, 2024releaseImpact: 90/100

Sora Public Release

OpenAI releases Sora publicly, enabling video generation for ChatGPT Plus and Pro users.

Dec 9, 2024releaseImpact: 85/100

GPT-4o Released

OpenAI releases GPT-4o with native audio, vision, and text in a single omni model.

May 13, 2024releaseImpact: 92/100

Sora Video Generation Preview

OpenAI previews Sora, capable of generating realistic 60-second videos from text.

Feb 15, 2024breakthroughImpact: 88/100

Leading Organizations

Google

OpenAI

ArXiv Categories

cs.CVcs.CLcs.LGcs.MM

Related Frontiers

Reasoning Memory

Multimodal AI

Key Benchmarks

Video-MME

MMMU

Recent Papers

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

DFlash: Block Diffusion for Flash Speculative Decoding

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

Future Optical Flow Prediction Improves Robot Control & Video Generation

HeartMuLa: A Family of Open Sourced Music Foundation Models

Action100M: A Large-scale Video Action Dataset

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

STEP3-VL-10B Technical Report

STResNet & STYOLO: A New Family of Compact Classification and Object Detection Models for MCUs