Systems
Research papers, repositories, and articles about systems
Showing 11 of 11 items
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
FlashPrefill discovers sparse attention patterns during the prefill phase and drops low-importance connections on the fly. It reports huge speedups on 256K-token contexts while still matching baseline accuracy.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.
colbymchenry/codegraph
Pre-indexed code knowledge graph tuned for AI coding tools like Claude Code, Cursor, and others. It cuts token usage and external tool calls by precomputing structure. If you’re scaling code assistants, this is a serious pattern for making them cheaper and faster.
ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
ORBITFLOW dynamically decides which attention-cache layers stay on GPU per request instead of using static offloading. It cuts tail latency and raises throughput for long-context chat and agents without buying more GPUs.
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
MoEBlaze redesigns mixture‑of‑experts training to cut activation memory and data movement on GPUs. It claims over 4× speedups and 50% memory savings versus existing frameworks, which directly matters for anyone pushing bigger sparse models.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation
CollectionLoRA distills dozens of separate image-style LoRAs into a single adapter without the usual "style bleeding". You get 50 visual effects with one small file. If you host many custom styles, this is a direct way to slash storage and server overhead.
revfactory/harness
Meta-agent that designs domain-specific agent teams, defines their roles, and generates their skills. It treats agent setups as code, not ad-hoc prompts. If you’re scaling from one agent to many, this shows how to standardize the "harness" layer.
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Proposes a new way for models to draft and prune multiple token paths in parallel. Targets faster text generation without running a new training run.
Kernel Foundry: A Diagnosis-Driven Evolutionary Kernel Optimizer with Multi-Experts
Kernel Foundry evolves GPU kernels using feedback from correctness checks and performance diagnostics instead of blind search. It reaches 100% correctness on a benchmark and beats hand-tuned baselines. If you’re fighting GPU bottlenecks, this hints that AI-guided kernel search is starting to work in practice.
D4Vinci/Scrapling
Adaptive web-scraping framework that scales from one-off fetches to large crawls. It’s designed to play nicely with AI agents that need to browse and extract data. If your agents keep breaking on websites, Scrapling is worth testing as the web layer.
rohitg00/ai-engineering-from-scratch
Opinionated curriculum and code for going from zero to shipping AI products. Covers data, evaluation, serving, and real-world tradeoffs, not just models. If you’re training new hires or upskilling yourself, this is a solid backbone syllabus.