JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetSpec adds a new draft head so you can propose large token trees in one forward pass while staying consistent with the base model. On Qwen3 models it reaches up to ~9.6x speedups on math without tanking quality, and integrates with vLLM. If you serve heavy workloads, this is a must-read for cutting the cost to run the AI. ([huggingface.co](https://huggingface.co/papers/2606.18394))
Lanxiang Hu, Zhaoxiang Feng