Systems

Research papers, repositories, and articles about systems

Showing 11 of 11 items

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

FlashPrefill discovers sparse attention patterns during the prefill phase and drops low-importance connections on the fly. It reports huge speedups on 256K-token contexts while still matching baseline accuracy.

Qihang Fan, Huaibo Huang

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.

Jian Yang, Wei Zhang

colbymchenry/codegraph

Pre-indexed code knowledge graph tuned for AI coding tools like Claude Code, Cursor, and others. It cuts token usage and external tool calls by precomputing structure. If you’re scaling code assistants, this is a serious pattern for making them cheaper and faster.

35,280

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.

Xinyue Ma, Heelim Hong

MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

MoEBlaze redesigns mixture‑of‑experts training to cut activation memory and data movement on GPUs. It claims over 4× speedups and 50% memory savings versus existing frameworks, which directly matters for anyone pushing bigger sparse models.

Jiyuan Zhang, Yining Liu

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA distills dozens of separate image-style LoRAs into a single adapter without the usual "style bleeding". You get 50 visual effects with one small file. If you host many custom styles, this is a direct way to slash storage and server overhead.

Fangtai Wu, Hailong Guo

revfactory/harness

Meta-agent that designs domain-specific agent teams, defines their roles, and generates their skills. It treats agent setups as code, not ad-hoc prompts. If you’re scaling from one agent to many, this shows how to standardize the "harness" layer.

4,549

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.

Tao Jin, Phuong Minh Nguyen

Kernel Foundry: A Diagnosis-Driven Evolutionary Kernel Optimizer with Multi-Experts

Kernel Foundry evolves GPU kernels using feedback from correctness checks and performance diagnostics instead of blind search. It reaches 100% correctness on a benchmark and beats hand-tuned baselines. If you’re fighting GPU bottlenecks, this hints that AI-guided kernel search is starting to work in practice.

Zixuan Huang, Da Chen

D4Vinci/Scrapling

Adaptive web-scraping framework that scales from one-off fetches to large crawls. It’s designed to play nicely with AI agents that need to browse and extract data. If your agents keep breaking on websites, Scrapling is worth testing as the web layer.

56,576

rohitg00/ai-engineering-from-scratch

Opinionated curriculum and code for going from zero to shipping AI products. Covers data, evaluation, serving, and real-world tradeoffs, not just models. If you’re training new hires or upskilling yourself, this is a solid backbone syllabus.

25,717