HuggingFace Paper

Stronger Normalization-Free Transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu +2December 11, 2025

Summary

HF highlights this work for showing that a carefully designed point-wise activation (Derf) can fully replace normalization layers in Transformers and still improve performance across multiple domains. For practitioners, it points toward simpler, potentially faster architectures without layer norm’s synchronization and batch-size headaches. ([huggingface.co](https://huggingface.co/papers/2512.10938))

Topics

transformers training architecture

View Original View PDF

Related Content

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

T-pro 2.0 is an open-weight Russian large language model focused on hybrid reasoning: it can answer directly or emit explicit reasoning traces, and it’s optimized for low-latency inference via speculative decoding. Alongside the model, the authors release a Russian instruction corpus, a math benchmark, and an EAGLE-based inference stack, making it a practical foundation for Russian-language reasoning applications.