Towards Scalable Pre-training of Visual Tokenizers for Generation
Summary
Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.
Related Content
stable-diffusion-webui
stable-diffusion-webui by AUTOMATIC1111 is the de facto standard local web interface for Stable Diffusion, providing a massive feature set—txt2img, img2img, inpainting/outpainting, upscaling, LoRA/embeddings support, training utilities, and a huge extension ecosystem—on top of consumer GPUs. If you’re doing any kind of image generation or fine-tuning with Stable Diffusion in a local or lab environment, this is usually the first tool people reach for and the one most community workflows target. ([github.com](https://github.com/AUTOMATIC1111/stable-diffusion-webui?utm_source=openai))
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.
SynthID Detector: Identify content made with Google's AI tools
Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))
Stronger Normalization-Free Transformers
Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))