Back to AI Lab
Latency
Research papers, repositories, and articles about latency
Showing 4 of 4 items
ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.
Xinyue Ma, Heelim Hong
Gemma 4 on Edge: Running Multimodal AI on Mobile, Raspberry Pi & IoT Devices
Walks through running Gemma 4’s edge models on phones, Pis, and Jetson boards. Covers quantization, latency numbers, and when to stay off the cloud.
Lushbinary Blog
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.
Tao Jin, Phuong Minh Nguyen
google-ai-edge/LiteRT-LM
Runtime for running language models efficiently on edge hardware. Targets phones, Jetson-class boards, and other low-power devices.
1,547