Hardware

Research papers, repositories, and articles about hardware

Showing 4 of 4 items

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Argues today’s popular 4‑bit number format systematically underestimates values and destabilizes large-model training. Proposes a uniform 4‑bit recipe that stays closer to BF16 while saving memory and compute.

Qian Zhao, Kunlong Chen

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Diagnoses a subtle numeric bias in current 4‑bit training formats and proposes a uniform alternative that stays stable on models up to 124B parameters. Hardware and training teams should read closely.

Qian Zhao, Kunlong Chen

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

AgentPerf, the first benchmark for agent workloads, shows NVIDIA’s Blackwell platform running many more agents per megawatt than older GPUs. It frames agent performance as an energy and density game, not just raw tokens per second.

NVIDIA Blog

Mugi: Value Level Parallelism For Efficient LLMs

Mugi generalizes value-level parallelism hardware tricks to full LLM workloads. It speeds up core math operations and softmax, yielding over 2x throughput and big energy savings on custom chips.

Daniel Price, Prabhu Vellaisamy