Transformers
Research papers, repositories, and articles about transformers
Showing 4 of 4 items
Stronger Normalization-Free Transformers
Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))
Stronger Normalization-Free Transformers
HF highlights this work for showing that a carefully designed point-wise activation (Derf) can fully replace normalization layers in Transformers and still improve performance across multiple domains. For practitioners, it points toward simpler, potentially faster architectures without layer norm’s synchronization and batch-size headaches. ([huggingface.co](https://huggingface.co/papers/2512.10938))
Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing
Keeps total parameters fixed but redistributes attention heads: few wide heads early, many narrow heads late. This simple change consistently beats standard layouts on language benchmarks. If you're running expensive training runs, this is a cheap architectural tweak to test. ([arxiv.org](https://arxiv.org/list/cs.LG/new))
Bi-Orthogonal Factor Decomposition for Vision Transformers
The authors dissect attention in vision transformers into content and position factors using ANOVA and SVD. They show heads specialize into different interaction types and explain why self-supervised models like DINOv2 use attention differently from supervised ones.