3d
Research papers, repositories, and articles about 3d
Showing 10 of 10 items
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
PixARMesh turns a single RGB image into a full 3D indoor mesh using a token-based decoder. It skips voxels and point clouds and targets artist-ready meshes in one shot.
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Depth Any Panoramas builds a single model for depth on 360° indoor and outdoor scenes. Robotics and AR teams can reuse this instead of training per-dataset depth nets.
Map2World: Segment Map Conditioned Text to 3D World Generation
Generates full 3D worlds from user-drawn segment maps, then adds fine detail with a separate enhancement network. Uses priors from existing asset generators to generalize across domains with limited training data. If you care about simulation, robotics, or game tools, this is a blueprint for controllable world generation. ([huggingface.co](https://huggingface.co/papers/2605.00781))
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
StereoPilot uses powerful generative models as priors for turning 2D content into stereo. If you care about 3D, VR, or depth effects, this is a new playbook.
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.
DragMesh: Interactive 3D Generation Made Easy
DragMesh offers a real-time framework for interactively generating articulated 3D motion by decoupling kinematics from motion generation, using a dual-quaternion VAE and FiLM conditioning. For 3D/graphics folks, it’s a signal that interactive, physically plausible articulation is becoming practical, not just offline. ([huggingface.co](https://huggingface.co/papers/2512.06424))
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.