Serving

Research papers, repositories, and articles about serving

Showing 2 of 2 items

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

sgl-project/mini-sglang

A slimmed-down version of the SGLang runtime aimed at easier experimentation. It focuses on fast text generation pipelines for modern language models in Python. ([github.com](https://github.com/trending))

1,895