Decoding
Research papers, repositories, and articles about decoding
Showing 4 of 4 items
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash pairs a tiny diffusion model with a big LLM to draft and verify text in big chunks. It’s currently one of the highest-upvoted speedup methods on Hugging Face.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash uses a small diffusion model to draft whole blocks of tokens in parallel, then lets a larger model quickly verify them. It keeps output quality while giving over 6x faster generation than standard decoding on common LLMs.
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
ReFusion is a masked diffusion model for text that decodes in parallel over contiguous ‘slots’ instead of individual tokens. By combining diffusion-based planning with autoregressive infilling, it recovers much of the quality of strong autoregressive LLMs while massively speeding up generation and allowing KV-cache reuse. This is one of the more serious attempts to rethink LLM decoding beyond the usual left-to-right paradigm.
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
This work replaces fixed block sizes in diffusion-style language models with blocks that expand or shrink based on how hard the text is. It also adds a cache that reuses past computation, cutting compute while keeping or improving generation quality.