Vision

Research papers, repositories, and articles about vision

Showing 21 of 21 items

stable-diffusion-webui

stable-diffusion-webui by AUTOMATIC1111 is the de facto standard local web interface for Stable Diffusion, providing a massive feature set—txt2img, img2img, inpainting/outpainting, upscaling, LoRA/embeddings support, training utilities, and a huge extension ecosystem—on top of consumer GPUs. If you’re doing any kind of image generation or fine-tuning with Stable Diffusion in a local or lab environment, this is usually the first tool people reach for and the one most community workflows target. ([github.com](https://github.com/AUTOMATIC1111/stable-diffusion-webui?utm_source=openai))

158,945

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.

Yiwen Tang, Zoey Guo

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

Google AI Blog

Recurrent Video Masked Autoencoders

Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.

Daniel Zoran, Nikhil Parthasarathy

Thinking with Images via Self-Calling Agent

Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))

Wenxi Yang, Yuzhong Zhao

Towards Scalable Pre-training of Visual Tokenizers for Generation

Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.

Jingfeng Yao, Yuda Song

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.

Songqiao Hu, Zeyi Liu

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.

Chenrui Fan, Yijun Liang

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.

Kehong Gong, Zhengyu Wen

Depixelization_poc

A proof-of-concept attack showing how pixelated screenshots can be reverse-engineered to recover underlying text using computer vision. A stark reminder that naive anonymization in UIs is often not privacy-safe. ([github.com](https://github.com/trending?since=daily))

3,659

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.

Tjark Behrens, Anton Obukhov

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

MiSI-Bench introduces "Microscopic Spatial Intelligence"—the ability to reason about invisible molecular 3D structures—and builds a massive VLM benchmark spanning 163k QA pairs over 4k molecules. Current VLMs lag well behind humans on many tasks, but a tuned 7B model can exceed human performance on some spatial transformations, highlighting both the promise and the need for domain knowledge in scientific AGI.

Zongzhao Li, Xiangzhe Kong

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

VQRAE introduces a unified visual tokenizer that can simultaneously support high-level multimodal understanding and discrete-token image generation. Building on a pretrained vision encoder and a high-dimensional semantic VQ codebook, it yields continuous semantic features for reasoning and discrete tokens for reconstruction, showing that quantizing semantic encoders with large codebooks can preserve both meaning and detail.

Sinan Du, Jiahao Guo

geoai

geoai is a Python package from the opengeos ecosystem that integrates deep-learning frameworks (PyTorch, Transformers, segmentation models) with geospatial tooling to handle everything from remote-sensing data download and tiling to training, inference, and interactive map visualization. It’s aimed at practitioners who want a higher-level, batteries-included stack for tasks like land-cover classification, building footprint extraction, and change detection, without reinventing all the GIS + ML plumbing. ([github.com](https://github.com/opengeos/geoai?utm_source=openai))

2,116

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Omni-Attribute is an open-vocabulary attribute encoder that learns to isolate specific visual factors—like style, lighting, or expression—rather than entangling everything into a single holistic embedding. Using curated positive/negative pairs and a dual generative/contrastive objective, it produces attribute-specific embeddings that are better for retrieval, personalization, and compositional image generation.

Tsai-Shien Chen, Aliaksandr Siarohin

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

DuetSVG proposes a unified multimodal model that generates both raster images and SVG code jointly, using the image stream to guide SVG token decoding. By letting the model "see" what it’s drawing during generation, it produces vector graphics that are more visually faithful, semantically correct, and syntactically clean than text-only SVG generators.

Peiying Zhang, Nanxuan Zhao

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

The authors augment multimodal LLMs with a "Video Toolkit" and a STAR (Spatiotemporal Reasoning) framework that orchestrates calls to temporal and spatial tools for video question answering. Instead of treating the video as a black-box embedding, the model actively localizes key regions over time using tools, yielding sizable gains on VideoMME and LongVideoBench when wrapped around GPT-4o.

Sunqi Fan, Jiashuo Cui

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

ReViSE defines a new Reason-Informed Video Editing task and benchmark, then introduces a unified video model that edits while continuously self-evaluating its own reasoning. A built-in VLM judges whether the edited video logically satisfies the instruction, providing self-reflective feedback that tightens the link between "understanding" and actual visual edits.

Xinyu Liu, Hangjie Yuan

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.

Sangwoon Kwak, Weeyoung Kwon

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.

Pei Yang, Hai Ci

RO-ViT: Region-aware pre-training for open-vocabulary ...

RO‑ViT proposes a region-aware pretraining scheme for vision transformers that uses cropped positional embeddings and focal loss to better align image–text pretraining with region-level object detection. Developers building open‑vocabulary detectors can reuse these ideas—plus the released code—to boost novel‑class detection without changing model capacity, especially when fine‑tuning ViT backbones on detection datasets. ([ai.googleblog.com](https://ai.googleblog.com/2023/08/ro-vit-region-aware-pre-training-for.html))

Google AI Blog