Back to AI Lab
ArXiv Paper

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

Shubham AggarwalJune 29, 2026

Summary

Keeps total parameters fixed but redistributes attention heads: few wide heads early, many narrow heads late. This simple change consistently beats standard layouts on language benchmarks. If you're running expensive training runs, this is a cheap architectural tweak to test. ([arxiv.org](https://arxiv.org/list/cs.LG/new))

Related Content