DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

How DeepSeek uses idle decode-side NICs to double KV-Cache loading throughput in prefill-decode disaggregated serving.

The Problem: Storage NICs Can't Keep Up
Background: PD Disaggregation and Agentic Workloads
The Key Insight
DualPath Architecture
CNIC-Centric Traffic Manager
Adaptive Request Scheduler
Results
Practical Notes

Terminology

Term	Full Form	What It Is
PE / DE	Prefill Engine / Decode Engine	PE processes the full prompt in parallel (compute-heavy); DE generates tokens autoregressively (memory-heavy).
PD	Prefill-Decode	Disaggregated serving: prefill and decode on separate GPU pools.
KV-Cache	Key-Value Cache	Cached attention tensors from previous tokens, reused across turns.
NIC / CNIC	Network Interface Card / Compute NIC	NIC connects a server to a network. CNIC is the high-bandwidth NIC for inter-GPU collectives (AllToAll, ReduceScatter).
RDMA	Remote Direct Memory Access	One machine reads/writes another's memory directly, bypassing the CPU.
HBM / DRAM	High Bandwidth Memory / Dynamic RAM	HBM is GPU on-chip memory; DRAM is CPU-side host memory used as a staging buffer.
PCIe	Peripheral Component Interconnect Express	High-speed bus connecting GPUs, NICs, and other devices in a server.
3FS	Fire-Flyer File System	DeepSeek's distributed filesystem for persistent KV-Cache storage.
QoS / VL	Quality of Service / Virtual Lane	QoS prioritizes traffic. VLs are hardware channels in an InfiniBand link for independent flow control.
RoCE	RDMA over Converged Ethernet	RDMA on standard Ethernet (alternative to InfiniBand).
H2D / D2H	Host-to-Device / Device-to-Host	Memory transfers between CPU DRAM and GPU HBM.
JCT	Job Completion Time	Wall-clock time from request submission to full response.
TTFT / TTST	Time to First Token / Time to Second Token	TTFT: latency before first output. TTST: proxy for per-token decode latency.
FLOPS	Floating Point Ops Per Second	GPU compute throughput. PFLOP = 10¹⁵ FLOP.

1. The Problem: Storage NICs Can't Keep Up

Modern LLM serving uses prefill-decode (PD) disaggregation: prefill and decode run on separate GPU pools. For agentic workloads (multi-turn tool-using agents), the context grows across turns, and nearly all of it is reusable via KV-Cache. DeepSeek reports a 98.7% KV-Cache hit rate in their agentic RL training workloads.

This sounds like good news, but it creates a bottleneck. All that cached KV data lives on remote storage (like 3FS) and must be loaded into GPU memory before prefill can begin. The storage NICs on the prefill side become the chokepoint.

GPU compute is not the bottleneck. For DeepSeek-V3.2 with a 98.7% hit rate, the cache-compute ratio is ~22 GB/PFLOP. The GPUs are waiting for data, not the other way around.
Hardware trends make it worse. From Ampere to Blackwell, GPU FLOPS grew 28.8x but NIC bandwidth only grew 2.0x. The I/O-to-compute ratio has dropped 14.4x.
Bandwidth is wasted. In conventional PD systems, only prefill engines read KV-Cache from storage. Decode engines' storage NICs sit idle. Half the cluster's storage bandwidth goes unused.

2. Background: PD Disaggregation and Agentic Workloads

Prefill-Decode Disaggregation

In PD-disaggregated inference, the cluster is split into two pools of GPUs:

Prefill Engines (PEs) process the full prompt (all input tokens in parallel). This is compute-heavy.
Decode Engines (DEs) generate tokens one at a time autoregressively. This is memory-bandwidth-heavy.

After prefill completes, the KV-Cache is transferred from the PE to a DE, which then handles decoding. Separating the two stages lets you optimize hardware and batching independently for each.

Why Agentic Workloads Are Special

An agentic workload is a multi-turn conversation where the LLM calls tools, reads outputs, and continues reasoning. Each turn appends new tokens (tool call + tool result) to a growing context.

The critical property: each new turn only adds a few hundred tokens to a context of tens of thousands. So 98%+ of the KV-Cache from previous turns can be reused. This cache is stored in distributed storage (3FS) and must be loaded before each prefill.

Intuition: Think of a restaurant kitchen with two stations: a prep station (PE) that assembles ingredients and a cooking station (DE) that cooks dishes. Both stations have their own loading docks for receiving deliveries from the warehouse. But right now, only the prep station ever orders from the warehouse. The cooking station's loading dock sits empty. Meanwhile, the prep station's dock is jammed with delivery trucks. DualPath lets the cooking station receive deliveries too, and pass ingredients to the prep station through an internal service window that's barely used during meal service.

3. The Key Insight

The observation that makes DualPath work is simple: the cluster has two separate networks, and they have very different utilization patterns.

Network	Purpose	Bandwidth	Utilization Pattern
Storage network	KV-Cache read/write to 3FS	~50 GB/s per node	Saturated on PE side, idle on DE side
Compute network	AllToAll, ReduceScatter (model parallelism)	~400 Gbps per NIC (8 NICs/node)	Bursty: sub-ms bursts with idle gaps

The compute network (InfiniBand RDMA) has far higher bandwidth: each node has 8 CNICs at 400 Gbps each (~400 GB/s aggregate) versus a single storage NIC at ~50 GB/s. This is by design; model parallelism requires moving large activation tensors between GPUs every forward pass, so the compute fabric is provisioned for peak throughput. But collective operations (AllToAll, ReduceScatter) happen in short bursts with idle gaps between them, leaving most of that bandwidth unused most of the time. Meanwhile, decode engines have their own storage NICs that do nothing during prefill-heavy phases.

DualPath's idea: load KV-Cache through both prefill and decode engines' storage NICs, then use the high-bandwidth compute network to shuttle data from decode engines to prefill engines. This transforms storage I/O from a single-sided bottleneck into a distributed, schedulable resource across all nodes.

4. DualPath Architecture

Animated: Baseline vs. DualPath

This animation shows a real scenario: Agent Turn 15 of an RL training rollout. The context is 32K tokens, 98.7% cached (31.6K tokens in KV-Cache on storage). Both systems must load the same amount of cached data before prefill can begin. Click "Start" to compare.

PE Read Path (Conventional)

This is the standard path. KV-Cache is read from persistent storage into the PE's DRAM buffer via its storage NIC. From DRAM, it moves through the CNIC to GPU HBM. During layerwise prefill, this happens one attention layer at a time: the PE reads one layer's KV-Cache into HBM, computes that attention layer, then loads the next.

DE Read Path (New)

This is DualPath's contribution. KV-Cache is read from storage into a decode engine's DRAM buffer (via the DE's otherwise-idle storage NIC). Then the DE's CNIC sends it to the PE's CNIC via high-bandwidth RDMA over the compute network. From there it flows into PE GPU HBM the same as the PE path.

Intuition: Instead of having one door into the building for deliveries (PE storage NIC), you open a second door through the neighboring building (DE storage NIC) and use an internal hallway (compute network RDMA) to move packages between buildings. The hallway has massive capacity and is only used in short bursts, so adding delivery traffic barely affects it.

Layerwise Prefill

A critical enabler for DualPath is layerwise prefill: instead of loading the entire KV-Cache (all layers) into GPU HBM at once, the system loads and processes one layer at a time. This is necessary because HBM capacity is limited, but it also means KV-Cache data is transferred in many small chunks (one layer's worth at a time).

This creates a design challenge. A model with 30 layers means 30 sequential load-compute cycles per request. Each load is a small transfer that must be efficiently overlapped with computation. DualPath uses Layer Blocks (shape: [1, tokens, bytes]) for these per-layer transfers and Full Blocks (shape: [layer, tokens, bytes]) for storage interactions.

5. CNIC-Centric Traffic Manager

Adding a second data path creates a practical problem: KV-Cache transfer traffic now shares the compute network and PCIe bus with latency-sensitive model execution operations (AllToAll for expert parallel, ReduceScatter for tensor parallel). These collective operations happen in sub-millisecond bursts and are critical for end-to-end latency.

The Problem with Existing Approaches

Existing GPU data transfer technologies (GPUDirect Storage, CUDA copy engine) don't provide fine-grained QoS control. They can't prevent KV-Cache traffic from interfering with collective communications. The paper measured CUDA copy engine overhead at 5-7 microseconds per operation, while RDMA write submission takes only ~1 microsecond.

CNIC as Central Traffic Controller

DualPath routes all GPU data traffic (including local H2D/D2H copies) through the GPU's paired CNIC using GPUDirect RDMA. This seems like a detour, but it has a key benefit: the CNIC becomes the single point of QoS control for all PCIe traffic.

For InfiniBand, DualPath uses Virtual Lanes (VLs) to isolate traffic:

Model inference traffic (AllToAll, etc.) gets a dedicated high-priority VL with ~99% of bandwidth via Weighted Round Robin scheduling.
KV-Cache transfer traffic gets a low-priority VL that opportunistically uses idle bandwidth.

This ensures KV-Cache traffic is essentially invisible to model execution. The same principle works on RoCE networks using Traffic Classes and DSCP markings.

6. Adaptive Request Scheduler

With two paths available, the system needs to decide for each request: which PE handles prefill, which DE handles decode, and which path loads the KV-Cache. A naive policy (e.g., round-robin assignment or always routing through the PE path first) can overload one side's storage NIC while the other sits idle, recreating the original bottleneck.

Inter-Engine Scheduling

The scheduler assigns each incoming request to a PE-DE pair. It picks PEs by checking two things: how backed up is the GPU (how many tokens are queued for computation), and how backed up is the disk (how many tokens are waiting to be read from storage).

If a PE's GPU queue is too long (more than ~5 seconds of work), it's overloaded and skipped entirely.
If both the GPU queue and the disk queue are short (disk queue under ~3 seconds of reads), it's a best candidate and preferred.
If the GPU has room but the disk queue is long, it's a fallback candidate, used only when no best candidates are available.

The intuition: avoid sending new work to a PE that's already drowning in either compute or I/O. Prefer PEs that are light on both.

For DEs, scheduling is two-phase: first spread requests evenly across DE groups (by total token count), then within a group pick the DE with the most free HBM.

KV-Cache Read Task Scheduling

After a PE-DE pair is selected, the scheduler checks which side has the shorter storage read queue and routes the KV-Cache read through that side. This simple heuristic naturally balances storage NIC utilization across the cluster.

Intra-Engine Scheduling

Within a PE, the system uses a compute quota to decide how many requests to include in each forward batch. Each request is described by a pair $(cached, bsz)$: how many tokens have KV-Cache already available, and how many tokens need fresh computation. The scheduler estimates attention layer execution time and packs requests until reaching the quota.

If a request would exceed the quota, binary search finds a smaller $bsz'$ and performs chunked prefill on the remainder. This keeps GPU utilization high while preventing individual large requests from creating stragglers in data-parallel setups.

7. Results

Offline Inference (RL Training Rollouts)

Model	Config	DualPath Speedup
DS 660B	2P4D, 32K-64K context	Up to 1.87x over baseline
DS 27B	1P1D	Up to 1.78x over baseline
Qwen 32B	1P2D	Similar trends to DS 27B

On DS 660B, DualPath nearly matches the Oracle configuration. Oracle is an idealized baseline where all KV-Cache is assumed to already be in GPU HBM (zero storage I/O). Matching Oracle means DualPath hides storage latency so effectively that the system behaves as if KV-Cache loading were free.

Online Serving

DualPath achieves 1.96x higher agent runs per second on average compared to the baseline. Key latency metrics:

TTST is comparable to baseline, meaning no additional decode overhead.
TTFT remains stable as load increases, while baseline's TTFT spikes due to storage NIC saturation.

Ablation: What Contributes What

On DS 660B with 64K context and 2048 agents, each component contributes cumulatively to JCT reduction: layerwise prefill alone gives 17% by hiding HBM transfer overhead, dual-path loading adds another 21% (total 38%) by doubling available storage bandwidth, and the adaptive scheduler adds another 8% (total 46%) by balancing load effectively.

Large-Scale Scalability

Tested up to 1,152 GPUs. Scaling from 2P4D (2K agents) to 48P96D (48K agents) achieves near-linear speedup with comparable JCT. For online serving, a 44P88D configuration achieves 22x throughput (8.8 vs 0.4 APS) while maintaining similar latency. Scheduler CPU stays below 10 cores.

8. Practical Notes

When does DualPath help most? When append lengths are short and context is long (high cache-hit ratio). With longer appends, GPU compute becomes the bottleneck instead of storage I/O, and DualPath's advantage shrinks. The paper shows that at 3x append length scaling, Basic performance approaches Oracle.

P/D ratio matters. DualPath and Basic perform comparably when they have equivalent total storage bandwidth. A Basic 1P2D system (one prefill node, two decode nodes) has the same storage bandwidth as DualPath 2P1D. The advantage of DualPath is that it can exploit any P/D ratio without wasting storage bandwidth on the idle side.

Key limitation: DualPath adds DRAM pressure on decode engines (DE buffer) and introduces additional PCIe traffic. The CNIC-centric approach, while enabling QoS, adds a small detour compared to direct GPUDirect Storage or CUDA copy. For small models where PCIe bandwidth is already tight, this overhead may not be negligible.

Implementation cost is modest. The entire DualPath implementation is approximately 5,000 lines of code on top of their existing inference framework, using FlashMLA, DeepGEMM, and DeepEP.

Storage backend. All experiments use 3FS (DeepSeek's distributed filesystem). The 3FS storage NIC has no internal DRAM cache and can saturate its 400 Gbps bandwidth. DualPath could be combined with a distributed DRAM cache (like Mooncake), but the paper notes the marginal performance gain is small.

Bottleneck-free range. The paper proves analytically that for typical configurations ($g=8$ GPUs per node, $s=1$ storage NIC, $M \approx 500$ GB/s memory bandwidth, $Bs \approx 50$ GB/s storage bandwidth), DualPath is bottleneck-free when $\frac{1}{7} \leq P/D \leq \frac{7}{2}$. This covers most practical deployments.

Reference: Shang et al., DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (2025)

Contents