Hardware
Research papers, repositories, and articles about hardware
Showing 4 of 4 items
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Argues today’s popular 4‑bit number format systematically underestimates values and destabilizes large-model training. Proposes a uniform 4‑bit recipe that stays closer to BF16 while saving memory and compute.
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Diagnoses a subtle numeric bias in current 4‑bit training formats and proposes a uniform alternative that stays stable on models up to 124B parameters. Hardware and training teams should read closely.
NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark
AgentPerf, the first benchmark for agent workloads, shows NVIDIA’s Blackwell platform running many more agents per megawatt than older GPUs. It frames agent performance as an energy and density game, not just raw tokens per second.
Mugi: Value Level Parallelism For Efficient LLMs
Mugi generalizes value-level parallelism hardware tricks to full LLM workloads. It speeds up core math operations and softmax, yielding over 2x throughput and big energy savings on custom chips.