DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
How DeepSeek uses idle decode-side NICs to double KV-Cache loading throughput in prefill-decode disaggregated serving.
Contents
Terminology
| Term | Full Form | What It Is |
|---|---|---|
| PE / DE | Prefill Engine / Decode Engine | PE processes the full prompt in parallel (compute-heavy); DE generates tokens autoregressively (memory-heavy). |
| PD | Prefill-Decode | Disaggregated serving: prefill and decode on separate GPU pools. |
| KV-Cache | Key-Value Cache | Cached attention tensors from previous tokens, reused across turns. |
| NIC / CNIC | Network Interface Card / Compute NIC | NIC connects a server to a network. CNIC is the high-bandwidth NIC for inter-GPU collectives (AllToAll, ReduceScatter). |
| RDMA | Remote Direct Memory Access | One machine reads/writes another's memory directly, bypassing the CPU. |
| HBM / DRAM | High Bandwidth Memory / Dynamic RAM | HBM is GPU on-chip memory; DRAM is CPU-side host memory used as a staging buffer. |
| PCIe | Peripheral Component Interconnect Express | High-speed bus connecting GPUs, NICs, and other devices in a server. |
| 3FS | Fire-Flyer File System | DeepSeek's distributed filesystem for persistent KV-Cache storage. |
| QoS / VL | Quality of Service / Virtual Lane | QoS prioritizes traffic. VLs are hardware channels in an InfiniBand link for independent flow control. |
| RoCE | RDMA over Converged Ethernet | RDMA on standard Ethernet (alternative to InfiniBand). |
| H2D / D2H | Host-to-Device / Device-to-Host | Memory transfers between CPU DRAM and GPU HBM. |
| JCT | Job Completion Time | Wall-clock time from request submission to full response. |
| TTFT / TTST | Time to First Token / Time to Second Token | TTFT: latency before first output. TTST: proxy for per-token decode latency. |
| FLOPS | Floating Point Ops Per Second | GPU compute throughput. PFLOP = 1015 FLOP. |
1. The Problem: Storage NICs Can't Keep Up
Modern LLM serving uses prefill-decode (PD) disaggregation: prefill and decode run on separate GPU pools. For agentic workloads (multi-turn tool-using agents), the context grows across turns, and nearly all of it is reusable via KV-Cache. DeepSeek reports a 98.7% KV-Cache hit rate in their agentic RL training workloads.
This sounds like good news, but it creates a bottleneck. All that cached KV data lives on remote storage (like 3FS) and must be loaded into GPU memory before prefill can begin. The storage NICs on the prefill side become the chokepoint.
- GPU compute is not the bottleneck. For DeepSeek-V3.2 with a 98.7% hit rate, the cache-compute ratio is ~22 GB/PFLOP. The GPUs are waiting for data, not the other way around.
- Hardware trends make it worse. From Ampere to Blackwell, GPU FLOPS grew 28.8x but NIC bandwidth only grew 2.0x. The I/O-to-compute ratio has dropped 14.4x.
- Bandwidth is wasted. In conventional PD systems, only prefill engines read KV-Cache from storage. Decode engines' storage NICs sit idle. Half the cluster's storage bandwidth goes unused.
2. Background: PD Disaggregation and Agentic Workloads
Prefill-Decode Disaggregation
In PD-disaggregated inference, the cluster is split into two pools of GPUs:
- Prefill Engines (PEs) process the full prompt (all input tokens in parallel). This is compute-heavy.
- Decode Engines (DEs) generate tokens one at a time autoregressively. This is memory-bandwidth-heavy.
After prefill completes, the KV-Cache is transferred from the PE to a DE, which then handles decoding. Separating the two stages lets you optimize hardware and batching independently for each.
Why Agentic Workloads Are Special
An agentic workload is a multi-turn conversation where the LLM calls tools, reads outputs, and continues reasoning. Each turn appends new tokens (tool call + tool result) to a growing context.
The critical property: each new turn only adds a few hundred tokens to a context of tens of thousands. So 98%+ of the KV-Cache from previous turns can be reused. This cache is stored in distributed storage (3FS) and must be loaded before each prefill.
3. The Key Insight
The observation that makes DualPath work is simple: the cluster has two separate networks, and they have very different utilization patterns.
| Network | Purpose | Bandwidth | Utilization Pattern |
|---|---|---|---|
| Storage network | KV-Cache read/write to 3FS | ~50 GB/s per node | Saturated on PE side, idle on DE side |
| Compute network | AllToAll, ReduceScatter (model parallelism) | ~400 Gbps per NIC (8 NICs/node) | Bursty: sub-ms bursts with idle gaps |
The compute network (InfiniBand RDMA) has far higher bandwidth: each node has 8 CNICs at 400 Gbps each (~400 GB/s aggregate) versus a single storage NIC at ~50 GB/s. This is by design; model parallelism requires moving large activation tensors between GPUs every forward pass, so the compute fabric is provisioned for peak throughput. But collective operations (AllToAll, ReduceScatter) happen in short bursts with idle gaps between them, leaving most of that bandwidth unused most of the time. Meanwhile, decode engines have their own storage NICs that do nothing during prefill-heavy phases.
DualPath's idea: load KV-Cache through both prefill and decode engines' storage NICs, then use the high-bandwidth compute network to shuttle data from decode engines to prefill engines. This transforms storage I/O from a single-sided bottleneck into a distributed, schedulable resource across all nodes.
4. DualPath Architecture
Animated: Baseline vs. DualPath
This animation shows a real scenario: Agent Turn 15 of an RL training rollout. The context is 32K tokens, 98.7% cached (31.6K tokens in KV-Cache on storage). Both systems must load the same amount of cached data before prefill can begin. Click "Start" to compare.
PE Read Path (Conventional)
This is the standard path. KV-Cache is read from persistent storage into the PE's DRAM buffer via its storage NIC. From DRAM, it moves through the CNIC to GPU HBM. During layerwise prefill, this happens one attention layer at a time: the PE reads one layer's KV-Cache into HBM, computes that attention layer, then loads the next.
DE Read Path (New)
This is DualPath's contribution. KV-Cache is read from storage into a decode engine's DRAM buffer (via the DE's otherwise-idle storage NIC). Then the DE's CNIC sends it to the PE's CNIC via high-bandwidth RDMA over the compute network. From there it flows into PE GPU HBM the same as the PE path.
Layerwise Prefill
A critical enabler for DualPath is layerwise prefill: instead of loading the entire KV-Cache (all layers) into GPU HBM at once, the system loads and processes one layer at a time. This is necessary because HBM capacity is limited, but it also means KV-Cache data is transferred in many small chunks (one layer's worth at a time).
This creates a design challenge. A model with 30 layers means 30 sequential load-compute cycles per request. Each load is a small transfer that must be efficiently overlapped with computation. DualPath uses Layer Blocks (shape: [1, tokens, bytes]) for these per-layer transfers and Full Blocks (shape: [layer, tokens, bytes]) for storage interactions.
5. CNIC-Centric Traffic Manager
Adding a second data path creates a practical problem: KV-Cache transfer traffic now shares the compute network and PCIe bus with latency-sensitive model execution operations (AllToAll for expert parallel, ReduceScatter for tensor parallel). These collective operations happen in sub-millisecond bursts and are critical for end-to-end latency.
The Problem with Existing Approaches
Existing GPU data transfer technologies (GPUDirect Storage, CUDA copy engine) don't provide fine-grained QoS control. They can't prevent KV-Cache traffic from interfering with collective communications. The paper measured CUDA copy engine overhead at 5-7 microseconds per operation, while RDMA write submission takes only ~1 microsecond.
CNIC as Central Traffic Controller
DualPath routes all GPU data traffic (including local H2D/D2H copies) through the GPU's paired CNIC using GPUDirect RDMA. This seems like a detour, but it has a key benefit: the CNIC becomes the single point of QoS control for all PCIe traffic.
For InfiniBand, DualPath uses Virtual Lanes (VLs) to isolate traffic:
- Model inference traffic (AllToAll, etc.) gets a dedicated high-priority VL with ~99% of bandwidth via Weighted Round Robin scheduling.
- KV-Cache transfer traffic gets a low-priority VL that opportunistically uses idle bandwidth.
This ensures KV-Cache traffic is essentially invisible to model execution. The same principle works on RoCE networks using Traffic Classes and DSCP markings.
6. Adaptive Request Scheduler
With two paths available, the system needs to decide for each request: which PE handles prefill, which DE handles decode, and which path loads the KV-Cache. A naive policy (e.g., round-robin assignment or always routing through the PE path first) can overload one side's storage NIC while the other sits idle, recreating the original bottleneck.
Inter-Engine Scheduling
The scheduler assigns each incoming request to a PE-DE pair. It picks PEs by checking two things: how backed up is the GPU (how many tokens are queued for computation), and how backed up is the disk (how many tokens are waiting to be read from storage).
- If a PE's GPU queue is too long (more than ~5 seconds of work), it's overloaded and skipped entirely.
- If both the GPU queue and the disk queue are short (disk queue under ~3 seconds of reads), it's a best candidate and preferred.
- If the GPU has room but the disk queue is long, it's a fallback candidate, used only when no best candidates are available.
The intuition: avoid sending new work to a PE that's already drowning in either compute or I/O. Prefer PEs that are light on both.
For DEs, scheduling is two-phase: first spread requests evenly across DE groups (by total token count), then within a group pick the DE with the most free HBM.
KV-Cache Read Task Scheduling
After a PE-DE pair is selected, the scheduler checks which side has the shorter storage read queue and routes the KV-Cache read through that side. This simple heuristic naturally balances storage NIC utilization across the cluster.
Intra-Engine Scheduling
Within a PE, the system uses a compute quota to decide how many requests to include in each forward batch. Each request is described by a pair $(cached, bsz)$: how many tokens have KV-Cache already available, and how many tokens need fresh computation. The scheduler estimates attention layer execution time and packs requests until reaching the quota.
If a request would exceed the quota, binary search finds a smaller $bsz'$ and performs chunked prefill on the remainder. This keeps GPU utilization high while preventing individual large requests from creating stragglers in data-parallel setups.
7. Results
Offline Inference (RL Training Rollouts)
| Model | Config | DualPath Speedup |
|---|---|---|
| DS 660B | 2P4D, 32K-64K context | Up to 1.87x over baseline |
| DS 27B | 1P1D | Up to 1.78x over baseline |
| Qwen 32B | 1P2D | Similar trends to DS 27B |
On DS 660B, DualPath nearly matches the Oracle configuration. Oracle is an idealized baseline where all KV-Cache is assumed to already be in GPU HBM (zero storage I/O). Matching Oracle means DualPath hides storage latency so effectively that the system behaves as if KV-Cache loading were free.
Online Serving
DualPath achieves 1.96x higher agent runs per second on average compared to the baseline. Key latency metrics:
- TTST is comparable to baseline, meaning no additional decode overhead.
- TTFT remains stable as load increases, while baseline's TTFT spikes due to storage NIC saturation.
Ablation: What Contributes What
On DS 660B with 64K context and 2048 agents, each component contributes cumulatively to JCT reduction: layerwise prefill alone gives 17% by hiding HBM transfer overhead, dual-path loading adds another 21% (total 38%) by doubling available storage bandwidth, and the adaptive scheduler adds another 8% (total 46%) by balancing load effectively.
Large-Scale Scalability
Tested up to 1,152 GPUs. Scaling from 2P4D (2K agents) to 48P96D (48K agents) achieves near-linear speedup with comparable JCT. For online serving, a 44P88D configuration achieves 22x throughput (8.8 vs 0.4 APS) while maintaining similar latency. Scheduler CPU stays below 10 cores.
8. Practical Notes
When does DualPath help most? When append lengths are short and context is long (high cache-hit ratio). With longer appends, GPU compute becomes the bottleneck instead of storage I/O, and DualPath's advantage shrinks. The paper shows that at 3x append length scaling, Basic performance approaches Oracle.
P/D ratio matters. DualPath and Basic perform comparably when they have equivalent total storage bandwidth. A Basic 1P2D system (one prefill node, two decode nodes) has the same storage bandwidth as DualPath 2P1D. The advantage of DualPath is that it can exploit any P/D ratio without wasting storage bandwidth on the idle side.
Implementation cost is modest. The entire DualPath implementation is approximately 5,000 lines of code on top of their existing inference framework, using FlashMLA, DeepGEMM, and DeepEP.
Storage backend. All experiments use 3FS (DeepSeek's distributed filesystem). The 3FS storage NIC has no internal DRAM cache and can saturate its 400 Gbps bandwidth. DualPath could be combined with a distributed DRAM cache (like Mooncake), but the paper notes the marginal performance gain is small.
Bottleneck-free range. The paper proves analytically that for typical configurations ($g=8$ GPUs per node, $s=1$ storage NIC, $M \approx 500$ GB/s memory bandwidth, $Bs \approx 50$ GB/s storage bandwidth), DualPath is bottleneck-free when $\frac{1}{7} \leq P/D \leq \frac{7}{2}$. This covers most practical deployments.
Reference: Shang et al., DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (2025)