Back to AI Lab

Diffusion

Research papers, repositories, and articles about diffusion

Showing 17 of 17 items

stable-diffusion-webui

stable-diffusion-webui by AUTOMATIC1111 is the de facto standard local web interface for Stable Diffusion, providing a massive feature set—txt2img, img2img, inpainting/outpainting, upscaling, LoRA/embeddings support, training utilities, and a huge extension ecosystem—on top of consumer GPUs. If you’re doing any kind of image generation or fine-tuning with Stable Diffusion in a local or lab environment, this is usually the first tool people reach for and the one most community workflows target. ([github.com](https://github.com/AUTOMATIC1111/stable-diffusion-webui?utm_source=openai))

158,945

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.

Yiwen Tang, Zoey Guo

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

Google AI Blog

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion is a masked diffusion model for text that decodes in parallel over contiguous ‘slots’ instead of individual tokens. By combining diffusion-based planning with autoregressive infilling, it recovers much of the quality of strong autoregressive LLMs while massively speeding up generation and allowing KV-cache reuse. This is one of the more serious attempts to rethink LLM decoding beyond the usual left-to-right paradigm.

Jia-Nan Li, Jian Guan

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.

Lijiang Li, Zuwei Long

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.

Hila Chefer, Patrick Esser

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.

Lizhuo Luo, Shenggui Li

EasyV2V: A High-quality Instruction-based Video Editing Framework

EasyV2V upgrades text-controlled video editing by cleverly generating training pairs from existing experts and images. If you're building video tools, this paper is a recipe for better data and architectures.

Jinjie Mai, Chaoyang Wang

Image Diffusion Preview with Consistency Solver

From DeepMind, this work uses consistency-based solvers to let users preview diffusion model outputs much more quickly than running a full sampling schedule. The idea is to generate rough-but-faithful previews that can guide prompt iteration and editing, then refine on demand. It’s another example of how inference-side tricks—not just bigger models—are improving practical usability of image generation.

Fu-Yun Wang, Hao Zhou

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing combines a diffusion-based generator with a preference-optimization trick to drive talking-head avatars in real time. It reacts to a user’s speech and body motion with low latency, producing more expressive, conversational faces without needing labeled interaction data.

Taekyung Ki, Sangwon Jang

NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

NanoFLUX distills big image generators into much smaller models that still follow prompts well on phones. It uses smart loss functions to keep visual quality while slashing memory and compute.

Ruchika Chavhan, Malcolm Chadwick

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

Diffusion Language Models Are Natively Length-Aware

Argues that diffusion-style language models naturally handle short and long prompts without special tricks. Points to a promising path for huge-context text models.

Vittorio Rossi, Giacomo Cirò

upscayl/upscayl

A cross‑platform image upscaler that uses open models to sharpen low-res photos. It’s a simple way to add high-quality upscaling to creative pipelines.

43,133

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.

Tjark Behrens, Anton Obukhov

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.

Pei Yang, Hai Ci