Back to AI Lab
Serving
Research papers, repositories, and articles about serving
Showing 2 of 2 items
ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.
Xinyue Ma, Heelim Hong
sgl-project/mini-sglang
A slimmed-down version of the SGLang runtime aimed at easier experimentation. It focuses on fast text generation pipelines for modern language models in Python. ([github.com](https://github.com/trending))
1,895