Architecture

Research papers, repositories, and articles about architecture

Showing 2 of 2 items

Stronger Normalization-Free Transformers

Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))

Mingzhi Chen, Taiming Lu

Stronger Normalization-Free Transformers

HF highlights this work for showing that a carefully designed point-wise activation (Derf) can fully replace normalization layers in Transformers and still improve performance across multiple domains. For practitioners, it points toward simpler, potentially faster architectures without layer norm’s synchronization and batch-size headaches. ([huggingface.co](https://huggingface.co/papers/2512.10938))

Mingzhi Chen, Taiming Lu