Latency

Research papers, repositories, and articles about latency

Showing 4 of 4 items

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

Gemma 4 on Edge: Running Multimodal AI on Mobile, Raspberry Pi & IoT Devices

Walks through running Gemma 4’s edge models on phones, Pis, and Jetson boards. Covers quantization, latency numbers, and when to stay off the cloud.

Lushbinary Blog

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen

google-ai-edge/LiteRT-LM

Runtime for running language models efficiently on edge hardware. Targets phones, Jetson-class boards, and other low-power devices.

1,547