Back to AI Lab

Generation

Research papers, repositories, and articles about generation

Showing 20 of 20 items

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.

Lijiang Li, Zuwei Long

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.

Jianxiong Gao, Zhaoxi Chen

resemble-ai/chatterbox

Chatterbox is a state-of-the-art open source text-to-speech stack. If you need production-quality voices without a SaaS bill, start here.

16,307

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.

Hila Chefer, Patrick Esser

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh turns a single RGB image into a full 3D indoor mesh using a token-based decoder. It skips voxels and point clouds and targets artist-ready meshes in one shot.

Xiang Zhang, Sohyun Yoo

HeartMuLa: A Family of Open Sourced Music Foundation Models

HeartMuLa bundles an audio–text matcher, robust lyric recognizer, music codec, and a music-generating LLM. You get controllable, prompt-driven song generation plus tools for indexing and understanding songs at scale.

Dongchao Yang, Yuxin Xie

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.

Wanlong Liu, Bo Zhang

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.

Longbin Ji, Xiaoxiong Liu

Towards Scalable Pre-training of Visual Tokenizers for Generation

Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.

Jingfeng Yao, Yuda Song

Leonxlnx/taste-skill

taste-skill acts as a style filter for generative models, steering them away from bland, generic output. You drop it into an agent and it scores drafts for "taste," nudging the system toward bolder, more specific writing.

43,611

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))

Wenda Chu, Bingliang Zhang

AIDC-AI/Pixelle-Video

End-to-end pipeline for fully automated short-form video creation with AI. Takes scripts or prompts to generate clips, edits, and captions. If you run content operations, this shows where AI video automation is headed in practice. ([github.com](https://github.com/trending/python?since=daily))

10,029

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Seedance 1.5 pro jointly generates video and sound from one model rather than bolting audio on later. Content teams can use this to explore tightly synced audio-visual experiences.

Heyi Chen, Siyan Chen

SFTok: Bridging the Performance Gap in Discrete Tokenizers

SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.

Qihang Rao, Borui Zhang

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

StereoPilot uses powerful generative models as priors for turning 2D content into stereo. If you care about 3D, VR, or depth effects, this is a new playbook.

Guibao Shen, Yihua Du

DragMesh: Interactive 3D Generation Made Easy

DragMesh offers a real-time framework for interactively generating articulated 3D motion by decoupling kinematics from motion generation, using a dual-quaternion VAE and FiLM conditioning. For 3D/graphics folks, it’s a signal that interactive, physically plausible articulation is becoming practical, not just offline. ([huggingface.co](https://huggingface.co/papers/2512.06424))

Tianshan Zhang, Zeyu Zhang

Multi-Objective Molecular Generation with Frequency-Controlled Evolutionary Dynamics

Represents molecules in a Fourier basis and evolves them with a multi-objective evolutionary algorithm. Separates coarse scaffold changes from fine local tweaks in a clean way. If you care about drug discovery or interpretable molecule search, this offers a training-free alternative to diffusion models. ([arxiv.org](https://arxiv.org/list/cs.NE/new))

Elia Colleoni, Paolo Guida

hardikpandya/stop-slop

A skill file that strips obvious AI "tells" from text. It pushes models to sound less generic and more human. If you care about brand voice or passing AI detectors, this is a handy, if controversial, tool.

7,720