ArXiv Paper

Deep Delta Learning

Yifan Zhang, Yifeng Liu, Mengdi Wang +1January 1, 2026

Summary

The authors replace standard residual skip connections with a learnable "Delta" operator that can flexibly distort the identity path. This lets deep nets control how much old information to erase versus new information to write, improving how they model complex dynamics while keeping training stable.

Topics

architecture training optimization theory

View Original View PDF

Related Content

Kling-Omni Technical Report

Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.

Stronger Normalization-Free Transformers

Introduces Derf, a simple point-wise activation that replaces normalization layers like LayerNorm and RMSNorm while improving generalization across vision, speech, DNA sequence modeling, and GPT-style language models. The authors systematically study properties of point-wise functions, run a large-scale search, and show Derf outperforms prior normalization-free approaches (e.g., Dynamic Tanh) with similar or better stability. ([arxiv.org](https://arxiv.org/abs/2512.10938))

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.