Back to AI Lab

Vision

Research papers, repositories, and articles about vision

Showing 50 of 53 items

stable-diffusion-webui

stable-diffusion-webui by AUTOMATIC1111 is the de facto standard local web interface for Stable Diffusion, providing a massive feature set—txt2img, img2img, inpainting/outpainting, upscaling, LoRA/embeddings support, training utilities, and a huge extension ecosystem—on top of consumer GPUs. If you’re doing any kind of image generation or fine-tuning with Stable Diffusion in a local or lab environment, this is usually the first tool people reach for and the one most community workflows target. ([github.com](https://github.com/AUTOMATIC1111/stable-diffusion-webui?utm_source=openai))

158,945

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.

Yiwen Tang, Zoey Guo

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

Google AI Blog

STEP3-VL-10B Technical Report

STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.

Ailin Huang, Chengyuan Yao

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))

Keming Wu, Zuhao Yang

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Fuses a contrastive vision encoder and a self-supervised encoder, then feeds the combined tokens into a language model. Yields stronger visual understanding and grounding benchmarks.

Ankan Deria, Komal Kumar

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Builds a single diffusion model for both understanding and generating across images and other formats. Aims to replace many task-specific generators with one backbone.

Lijiang Li, Zuwei Long

Action100M: A Large-scale Video Action Dataset

Action100M is a fully-automatic dataset built from over a million how-to videos, giving around 100 million labeled action snippets. It uses V-JEPA features plus a GPT-based pipeline to label segments, and it unlocks clean scaling curves for action recognition models.

Delong Chen, Tejaswi Kasarla

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

WorldCanvas lets you script rich video scenes using text, object trajectories, and reference images instead of frame-by-frame editing. Use this as a template for building controllable world simulators and advanced video tools.

Hanlin Wang, Hao Ouyang

Make Your LVLM KV Cache More Lightweight

Targets the memory blow-up from vision tokens in large vision–language models when you run the AI. Uses a prompt-aware method, LightKV, to merge redundant vision tokens before decoding. If you ship LVLMs, this is a concrete way to cut GPU memory and costs without killing quality. ([arxiv.org](https://arxiv.org/list/cs.CV/pastweek?show=100))

Anonymous (ICLR and TMLR drafts; arXiv metadata lists named authors)

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Uses self-supervision to train flow-matching generators across data types like images and video. Seeks more stable, high-quality generation with fewer labeled examples.

Hila Chefer, Patrick Esser

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh turns a single RGB image into a full 3D indoor mesh using a token-based decoder. It skips voxels and point clouds and targets artist-ready meshes in one shot.

Xiang Zhang, Sohyun Yoo

Recurrent Video Masked Autoencoders

Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.

Daniel Zoran, Nikhil Parthasarathy

Thinking with Images via Self-Calling Agent

Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))

Wenxi Yang, Yuzhong Zhao

Towards Scalable Pre-training of Visual Tokenizers for Generation

Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.

Jingfeng Yao, Yuda Song

Let ViT Speak: Generative Language-Image Pre-training

Trains a Vision Transformer to predict language tokens directly from image tokens using a standard language-model objective. Removes contrastive tricks and extra decoders while staying competitive on many multimodal benchmarks. If you maintain vision backbones for language models, this is a simpler pretraining recipe to test. ([huggingface.co](https://huggingface.co/papers/2605.00809))

Yan Fang, Mengcheng Lan

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

The authors release LMEE-Bench to test how agents explore and remember in long-horizon 3D tasks. Their MemoryExplorer method trains a vision-language model with reinforcement learning to actively query and use episodic memory.

Sen Wang, Bangwei Liu

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Pushes vision-language models toward smaller, cheaper designs built from language-style encoders. Targets strong image+text performance while keeping running costs low.

Boqiang Zhang, Lei Ke

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

This paper teaches vision‑language agents to actively consult maps while guessing where a photo was taken. With reinforcement learning and parallel search paths, their system beats even strong commercial baselines on real‑world geolocation benchmarks.

Yuxiang Ji, Yong Wang

EasyV2V: A High-quality Instruction-based Video Editing Framework

EasyV2V upgrades text-controlled video editing by cleverly generating training pairs from existing experts and images. If you're building video tools, this paper is a recipe for better data and architectures.

Jinjie Mai, Chaoyang Wang

Next-Embedding Prediction Makes Strong Vision Learners

Instead of predicting pixels or patches, this method predicts the next embedding in a learned space. Vision folks can plug this into pretraining to squeeze more out of ImageNet-scale data.

Sihan Xu, Ziqiao Ma

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Depth Any Panoramas builds a single model for depth on 360° indoor and outdoor scenes. Robotics and AR teams can reuse this instead of training per-dataset depth nets.

Xin Lin, Meixi Song

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

VIVA uses a vision-language model to encode instructions and a reward-optimized diffusion model to edit videos. Great blueprint for anyone mixing video generation with RL-style feedback.

Xiaoyan Cong, Haotian Yang

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))

Wenda Chu, Bingliang Zhang

Map2World: Segment Map Conditioned Text to 3D World Generation

Generates full 3D worlds from user-drawn segment maps, then adds fine detail with a separate enhancement network. Uses priors from existing asset generators to generalize across domains with limited training data. If you care about simulation, robotics, or game tools, this is a blueprint for controllable world generation. ([huggingface.co](https://huggingface.co/papers/2605.00781))

Jaeyoung Chung, Suyoung Lee

Blaizzy/mlx-vlm

Lets you run and customize vision-language models on Apple Silicon using MLX. Great for building local image-and-text assistants on a Mac.

3,911

Bi-Orthogonal Factor Decomposition for Vision Transformers

The authors dissect attention in vision transformers into content and position factors using ANOVA and SVD. They show heads specialize into different interaction types and explain why self-supervised models like DINOv2 use attention differently from supervised ones.

Fenil R. Doshi, Thomas Fel

SFTok: Bridging the Performance Gap in Discrete Tokenizers

SFTok narrows the quality gap between discrete and continuous image tokenizers using multi-step reconstruction tricks. That matters if you want autoregressive image or video generators that scale.

Qihang Rao, Borui Zhang

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.

Songqiao Hu, Zeyi Liu

roboflow/supervision

Utility library that standardizes computer-vision pre- and post-processing. Saves you from rewriting glue code around detection and segmentation models.

37,698

Multimodal Large Language Models as Image Classifiers

Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.

Nikita Kisel, Illia Volkov

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu

Future Optical Flow Prediction Improves Robot Control & Video Generation

FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.

Kanchana Ranasinghe, Honglu Zhou

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.

Yuqun Wu, Chih-hao Lin

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

StereoPilot uses powerful generative models as priors for turning 2D content into stereo. If you care about 3D, VR, or depth effects, this is a new playbook.

Guibao Shen, Yihua Du

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.

Chenrui Fan, Yijun Liang

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.

Kehong Gong, Zhengyu Wen

STResNet & STYOLO: A New Family of Compact Classification and Object Detection Models for MCUs

STResNet and STYOLO target microcontrollers, hitting competitive ImageNet and COCO scores with just a few million parameters. If you care about on-device vision, these architectures offer stronger accuracy–latency tradeoffs than classic MobileNet-style baselines.

Sudhakar Sah, Ravish Kumar

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Builds a plug-and-play framework for 3D medical imaging that unifies detection and risk prediction. Targets hospital workflows, not just leaderboard benchmarks.

Daniel C. MacRae, Luuk van der Hoek

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Probes frozen vision backbones for tasks like measuring lengths and angles. Tests how much real-world geometry is already baked into general-purpose models.

Yakov Pyotr Shkolnikov

Depixelization_poc

A proof-of-concept attack showing how pixelated screenshots can be reverse-engineered to recover underlying text using computer vision. A stark reminder that naive anonymization in UIs is often not privacy-safe. ([github.com](https://github.com/trending?since=daily))

3,659

jomjol/AI-on-the-edge-device

Firmware that reads analog meters and similar devices with a tiny on-device vision model. It’s a practical template for bringing AI to legacy hardware.

8,044

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

StereoSpace is a diffusion-based monocular-to-stereo system that learns geometric consistency purely from viewpoint conditioning, without explicitly predicting depth or doing warping. The authors also propose a strictly "geometry-free at test time" evaluation protocol and show their method produces sharper parallax and more comfortable stereo than existing depth- or warp-based pipelines.

Tjark Behrens, Anton Obukhov

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

MiSI-Bench introduces "Microscopic Spatial Intelligence"—the ability to reason about invisible molecular 3D structures—and builds a massive VLM benchmark spanning 163k QA pairs over 4k molecules. Current VLMs lag well behind humans on many tasks, but a tuned 7B model can exceed human performance on some spatial transformations, highlighting both the promise and the need for domain knowledge in scientific AGI.

Zongzhao Li, Xiangzhe Kong

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

VQRAE introduces a unified visual tokenizer that can simultaneously support high-level multimodal understanding and discrete-token image generation. Building on a pretrained vision encoder and a high-dimensional semantic VQ codebook, it yields continuous semantic features for reasoning and discrete tokens for reconstruction, showing that quantizing semantic encoders with large codebooks can preserve both meaning and detail.

Sinan Du, Jiahao Guo

geoai

geoai is a Python package from the opengeos ecosystem that integrates deep-learning frameworks (PyTorch, Transformers, segmentation models) with geospatial tooling to handle everything from remote-sensing data download and tiling to training, inference, and interactive map visualization. It’s aimed at practitioners who want a higher-level, batteries-included stack for tasks like land-cover classification, building footprint extraction, and change detection, without reinventing all the GIS + ML plumbing. ([github.com](https://github.com/opengeos/geoai?utm_source=openai))

2,116

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Omni-Attribute is an open-vocabulary attribute encoder that learns to isolate specific visual factors—like style, lighting, or expression—rather than entangling everything into a single holistic embedding. Using curated positive/negative pairs and a dual generative/contrastive objective, it produces attribute-specific embeddings that are better for retrieval, personalization, and compositional image generation.

Tsai-Shien Chen, Aliaksandr Siarohin

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

DuetSVG proposes a unified multimodal model that generates both raster images and SVG code jointly, using the image stream to guide SVG token decoding. By letting the model "see" what it’s drawing during generation, it produces vector graphics that are more visually faithful, semantically correct, and syntactically clean than text-only SVG generators.

Peiying Zhang, Nanxuan Zhao

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

The authors augment multimodal LLMs with a "Video Toolkit" and a STAR (Spatiotemporal Reasoning) framework that orchestrates calls to temporal and spatial tools for video question answering. Instead of treating the video as a black-box embedding, the model actively localizes key regions over time using tools, yielding sizable gains on VideoMME and LongVideoBench when wrapped around GPT-4o.

Sunqi Fan, Jiashuo Cui

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

ReViSE defines a new Reason-Informed Video Editing task and benchmark, then introduces a unified video model that edits while continuously self-evaluating its own reasoning. A built-in VLM judges whether the edited video logically satisfies the instruction, providing self-reflective feedback that tightens the link between "understanding" and actual visual edits.

Xinyu Liu, Hangjie Yuan