Theory
Research papers, repositories, and articles about theory
Showing 5 of 5 items
ZJU-LLMs/Foundations-of-LLMs
An open book and course materials on the foundations of large language models, covering theory, architectures, training, and deployment. With >14k stars, it’s quickly becoming a go‑to learning resource for people trying to move from ‘user’ to ‘builder’ of LLMs. If you want a structured, code-linked path into the guts of modern LMs, this is a strong candidate.
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Claims an exact, error-free formulation of linear attention derived from a continuous-time view of transformer dynamics. The authors argue they can match the behavior of standard softmax attention while enjoying linear-time complexity, avoiding the approximation errors that plague many fast-attention variants. If the theory and practice hold up, this could become a key building block for large-context models and resource-constrained deployments.
Deep Delta Learning
The authors replace standard residual skip connections with a learnable "Delta" operator that can flexibly distort the identity path. This lets deep nets control how much old information to erase versus new information to write, improving how they model complex dynamics while keeping training stable.
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
This paper dissects why "learning from verifiable rewards" can improve math reasoning even when rewards look noisy or misleading. It shows how clipping and reward noise reduce the model’s randomness in useful ways and offers principles for designing better reasoning-focused training runs. ([ar5iv.org](https://ar5iv.org/abs/2512.16912))
GeoDM: Geometry-aware Distribution Matching for Dataset Distillation
Proposes GeoDM, a dataset distillation framework that performs distribution matching in a product space of Euclidean, hyperbolic, and spherical manifolds, with learnable curvature and weights. This geometry-aware approach yields lower generalization error bounds and consistently outperforms prior distillation methods by better aligning synthetic and real-data manifolds. ([arxiv.org](https://arxiv.org/abs/2512.08317?utm_source=openai))