Back to AI Lab

3d

Research papers, repositories, and articles about 3d

Showing 10 of 10 items

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.

Yiwen Tang, Zoey Guo

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh turns a single RGB image into a full 3D indoor mesh using a token-based decoder. It skips voxels and point clouds and targets artist-ready meshes in one shot.

Xiang Zhang, Sohyun Yoo

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Depth Any Panoramas builds a single model for depth on 360° indoor and outdoor scenes. Robotics and AR teams can reuse this instead of training per-dataset depth nets.

Xin Lin, Meixi Song

Map2World: Segment Map Conditioned Text to 3D World Generation

Generates full 3D worlds from user-drawn segment maps, then adds fine detail with a separate enhancement network. Uses priors from existing asset generators to generalize across domains with limited training data. If you care about simulation, robotics, or game tools, this is a blueprint for controllable world generation. ([huggingface.co](https://huggingface.co/papers/2605.00781))

Jaeyoung Chung, Suyoung Lee

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.

Yuqun Wu, Chih-hao Lin

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

StereoPilot uses powerful generative models as priors for turning 2D content into stereo. If you care about 3D, VR, or depth effects, this is a new playbook.

Guibao Shen, Yihua Du

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.

Kehong Gong, Zhengyu Wen

DragMesh: Interactive 3D Generation Made Easy

DragMesh offers a real-time framework for interactively generating articulated 3D motion by decoupling kinematics from motion generation, using a dual-quaternion VAE and FiLM conditioning. For 3D/graphics folks, it’s a signal that interactive, physically plausible articulation is becoming practical, not just offline. ([huggingface.co](https://huggingface.co/papers/2512.06424))

Tianshan Zhang, Zeyu Zhang

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.

Tjark Behrens, Anton Obukhov

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.

Sangwoon Kwak, Weeyoung Kwon