Back to AI Lab
ArXiv Paper

Stronger Normalization-Free Transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu +2December 11, 2025

Summary

Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))

Related Content