Skip to main content

Native KV Cache Offloading to Any Filesystem with llm-d

· 11 min read
Kfir Toledo
Research Staff Member, IBM
Danny Harnik
Senior Technical Staff Member, IBM
Effi Ofer
Research Staff Member, IBM
Or Ozeri
Research Staff Member, IBM
Guy Margalit
Senior Technical Staff Member, IBM Storage CTO Ofiice

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.

llm-d 0.5: Sustaining Performance at Scale

· 13 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

In our previous release (v0.4), we focused on improving the end-to-end latency of production inference, introducing speculative decoding and extending prefill/decode disaggregation across a broader set of accelerator architectures. That work established llm-d’s ability to deliver state-of-the-art latency along the critical serving path. Sustaining low latency increasingly depended on how KV-cache pressure is handled once GPU memory is saturated, whether cached state can be reused across replicas instead of being repeatedly rebuilt, and how requests are routed when workloads mix adapters, models, and availability requirements.

With v0.5, llm-d expands its focus from peak performance to the operational rigor required to sustain performance at scale. This release prioritizes reproducibility, resilience, and cost efficiency, with concrete improvements across the following areas:

  1. Developer Experience and reproducibility: We have simplified the benchmarking workflow with dedicated, in-guide benchmark support, allowing users to validate each “well-lit path” with a single command.
  2. Hierarchical KV Offloading: A new storage architecture decouples cache capacity from GPU memory through native CPU and filesystem tiers.
  3. Advanced Scheduling: Cache-aware routing now supports LoRA adapters and active-active high availability.
  4. Resilient Networking: A new transport backend (UCCL) improves stability in congested networks.
  5. Autoscaling Updates: We have introduced scale-to-zero capabilities for cost-efficient intermittent workloads.

llm-d 0.4: Achieve SOTA Performance Across Accelerators

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our 0.3 release we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.

This release brings the complement to expert parallelism throughput: improving end-to-end request latency of production serving. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.

These OSS inference stack optimizations, surfaced through our well-lit paths, ensure you reach SOTA latency on frontier OSS models in real world scenarios.

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

· 10 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

  • Expanded hardware support, now including Google TPU and Intel support
  • TCP and RDMA over RoCE validated for disaggregation
  • A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
  • Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
  • The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

· 21 min read
Maroon Ayoub
Research Scientist & Architect, IBM
Danny Harnik
Senior Technical Staff Member, IBM
Tyler Smith
Member of Technical Staff, Red Hat
Kellen Swain
Software Engineer, Google
Xining Wang
Xining Wang
Senior Technical Expert, Alibaba Cloud
Hang Yin
Hang Yin
Senior R&D Engineer, Alibaba Cloud
Kay Yan
Principal Software Engineer, DaoCloud

The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, Intelligent Inference Scheduling, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an approximate method for the latter, predicting cache locality based on request traffic.

This blog illuminates a more advanced and powerful path: precise prefix-cache aware scheduling.

We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.

Blog key takeaways
  • KV-cache hit rates directly impact your bottom line: With 10x cost differences between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver
  • This isn't theoretical: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels
  • vLLM's prefix caching breaks in distributed deployments: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation
  • Precise prefix-cache aware scheduling delivers order-of-magnitude gains: Our benchmarks show 57x faster response times and double the throughput on identical hardware

Intelligent Inference Scheduling with llm-d

· 10 min read
Nili Guy
R&D Manager, AI Infrastructure, IBM
Vita Bortnikov
IBM Fellow, IBM
Etai Lev Ran
Cloud Architect, IBM
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

Intelligent inference scheduling diagram

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

· 11 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

llm-d Community Update - June 2025

· 4 min read
Pete Cheslock
AI Community Architect, Red Hat

Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.

Help Shape the Future of the llm-d Project

To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.

This anonymous, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the llm-d-contributors mailing list to benefit the entire community.

Your Input Will Define Our Roadmap

We've created an llm-d Community Roadmap Survey to gather information about your LLM workloads. We are looking to learn more about:

  • Your Serving Environment: This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.
  • Your Model Strategy: Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.
  • Your Performance Requirements: Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.
  • Your Future Needs: What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.

Take the 5-Minute Survey

Your participation is invaluable. Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.

Announcing the llm-d community!

· 12 min read
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google
Carlos Costa
Distinguished Engineer, IBM

Announcing the llm-d community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).