Back to AI Lab

Architecture

Research papers, repositories, and articles about architecture

Showing 6 of 6 items

Stronger Normalization-Free Transformers

Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))

Mingzhi Chen, Taiming Lu

Stronger Normalization-Free Transformers

HF highlights this work for showing that a carefully designed point-wise activation (Derf) can fully replace normalization layers in Transformers and still improve performance across multiple domains. For practitioners, it points toward simpler, potentially faster architectures without layer norm’s synchronization and batch-size headaches. ([huggingface.co](https://huggingface.co/papers/2512.10938))

Mingzhi Chen, Taiming Lu

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

Fast-weight Product Key Memory

Fast-weight Product Key Memory adds a dynamic, almost "scratchpad" store alongside the usual attention in language models. It aims to keep the efficiency of linear attention while recovering much of softmax attention’s ability to remember rare, long-range details.

Tianyu Zhao, Llion Jones

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

Keeps total parameters fixed but redistributes attention heads: few wide heads early, many narrow heads late. This simple change consistently beats standard layouts on language benchmarks. If you're running expensive training runs, this is a cheap architectural tweak to test. ([arxiv.org](https://arxiv.org/list/cs.LG/new))

Shubham Aggarwal

Deep Delta Learning

The authors replace standard residual skip connections with a learnable "Delta" operator that can flexibly distort the identity path. This lets deep nets control how much old information to erase versus new information to write, improving how they model complex dynamics while keeping training stable.

Yifan Zhang, Yifeng Liu