Generation
Research papers, repositories, and articles about generation
Showing 20 of 20 items
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
resemble-ai/chatterbox
Chatterbox is a state-of-the-art open source text-to-speech stack. If you need production-quality voices without a SaaS bill, start here.
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.
Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
PixARMesh turns a single RGB image into a full 3D indoor mesh using a token-based decoder. It skips voxels and point clouds and targets artist-ready meshes in one shot.
HeartMuLa: A Family of Open Sourced Music Foundation Models
HeartMuLa bundles an audio–text matcher, robust lyric recognizer, music codec, and a music-generating LLM. You get controllable, prompt-driven song generation plus tools for indexing and understanding songs at scale.
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.
Towards Scalable Pre-training of Visual Tokenizers for Generation
Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.
Leonxlnx/taste-skill
taste-skill acts as a style filter for generative models, steering them away from bland, generic output. You drop it into an agent and it scores drafts for "taste," nudging the system toward bolder, more specific writing.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))
AIDC-AI/Pixelle-Video
End-to-end pipeline for fully automated short-form video creation with AI. Takes scripts or prompts to generate clips, edits, and captions. If you run content operations, this shows where AI video automation is headed in practice. ([github.com](https://github.com/trending/python?since=daily))
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Seedance 1.5 pro jointly generates video and sound from one model rather than bolting audio on later. Content teams can use this to explore tightly synced audio-visual experiences.
SFTok: Bridging the Performance Gap in Discrete Tokenizers
SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
StereoPilot uses powerful generative models as priors for turning 2D content into stereo. If you care about 3D, VR, or depth effects, this is a new playbook.
DragMesh: Interactive 3D Generation Made Easy
DragMesh offers a real-time framework for interactively generating articulated 3D motion by decoupling kinematics from motion generation, using a dual-quaternion VAE and FiLM conditioning. For 3D/graphics folks, it’s a signal that interactive, physically plausible articulation is becoming practical, not just offline. ([huggingface.co](https://huggingface.co/papers/2512.06424))
Multi-Objective Molecular Generation with Frequency-Controlled Evolutionary Dynamics
Represents molecules in a Fourier basis and evolves them with a multi-objective evolutionary algorithm. Separates coarse scaffold changes from fine local tweaks in a clean way. If you care about drug discovery or interpretable molecule search, this offers a training-free alternative to diffusion models. ([arxiv.org](https://arxiv.org/list/cs.NE/new))
hardikpandya/stop-slop
A skill file that strips obvious AI "tells" from text. It pushes models to sound less generic and more human. If you care about brand voice or passing AI detectors, this is a handy, if controversial, tool.