How DeepSeek uses idle decode-side NICs to double KV-Cache loading throughput in prefill-decode disaggregated serving.

Terminology

TermFull FormWhat It Is
PE / DEPrefill Engine / Decode EnginePE processes the full prompt in parallel (compute-heavy); DE generates tokens autoregressively (memory-heavy).
PDPrefill-DecodeDisaggregated serving: prefill and decode on separate GPU pools.
KV-CacheKey-Value CacheCached attention tensors from previous tokens, reused across turns.
NIC / CNICNetwork Interface Card / Compute NICNIC connects a server to a network. CNIC is the high-bandwidth NIC for inter-GPU collectives (AllToAll, ReduceScatter).
RDMARemote Direct Memory AccessOne machine reads/writes another's memory directly, bypassing the CPU.
HBM / DRAMHigh Bandwidth Memory / Dynamic RAMHBM is GPU on-chip memory; DRAM is CPU-side host memory used as a staging buffer.
PCIePeripheral Component Interconnect ExpressHigh-speed bus connecting GPUs, NICs, and other devices in a server.
3FSFire-Flyer File SystemDeepSeek's distributed filesystem for persistent KV-Cache storage.
QoS / VLQuality of Service / Virtual LaneQoS prioritizes traffic. VLs are hardware channels in an InfiniBand link for independent flow control.
RoCERDMA over Converged EthernetRDMA on standard Ethernet (alternative to InfiniBand).
H2D / D2HHost-to-Device / Device-to-HostMemory transfers between CPU DRAM and GPU HBM.
JCTJob Completion TimeWall-clock time from request submission to full response.
TTFT / TTSTTime to First Token / Time to Second TokenTTFT: latency before first output. TTST: proxy for per-token decode latency.
FLOPSFloating Point Ops Per SecondGPU compute throughput. PFLOP = 1015 FLOP.

1. The Problem: Storage NICs Can't Keep Up

Modern LLM serving uses prefill-decode (PD) disaggregation: prefill and decode run on separate GPU pools. For agentic workloads (multi-turn tool-using agents), the context grows across turns, and nearly all of it is reusable via KV-Cache. DeepSeek reports a 98.7% KV-Cache hit rate in their agentic RL training workloads.

This sounds like good news, but it creates a bottleneck. All that cached KV data lives on remote storage (like 3FS) and must be loaded into GPU memory before prefill can begin. The storage NICs on the prefill side become the chokepoint.

2. Background: PD Disaggregation and Agentic Workloads

Prefill-Decode Disaggregation

In PD-disaggregated inference, the cluster is split into two pools of GPUs:

After prefill completes, the KV-Cache is transferred from the PE to a DE, which then handles decoding. Separating the two stages lets you optimize hardware and batching independently for each.

Why Agentic Workloads Are Special

An agentic workload is a multi-turn conversation where the LLM calls tools, reads outputs, and continues reasoning. Each turn appends new tokens (tool call + tool result) to a growing context.

The critical property: each new turn only adds a few hundred tokens to a context of tens of thousands. So 98%+ of the KV-Cache from previous turns can be reused. This cache is stored in distributed storage (3FS) and must be loaded before each prefill.

Agentic Turn Structure Turn N context: 30,000 tokens Cached KV from previous turns (~29,400 tokens, 98%) New Load from storage Distributed Storage (3FS) Via storage NIC Prefill Engine GPU Bottleneck: storage NIC bandwidth limits how fast KV-Cache reaches the PE
Intuition: Think of a restaurant kitchen with two stations: a prep station (PE) that assembles ingredients and a cooking station (DE) that cooks dishes. Both stations have their own loading docks for receiving deliveries from the warehouse. But right now, only the prep station ever orders from the warehouse. The cooking station's loading dock sits empty. Meanwhile, the prep station's dock is jammed with delivery trucks. DualPath lets the cooking station receive deliveries too, and pass ingredients to the prep station through an internal service window that's barely used during meal service.

3. The Key Insight

The observation that makes DualPath work is simple: the cluster has two separate networks, and they have very different utilization patterns.

Network Purpose Bandwidth Utilization Pattern
Storage network KV-Cache read/write to 3FS ~50 GB/s per node Saturated on PE side, idle on DE side
Compute network AllToAll, ReduceScatter (model parallelism) ~400 Gbps per NIC (8 NICs/node) Bursty: sub-ms bursts with idle gaps

The compute network (InfiniBand RDMA) has far higher bandwidth: each node has 8 CNICs at 400 Gbps each (~400 GB/s aggregate) versus a single storage NIC at ~50 GB/s. This is by design; model parallelism requires moving large activation tensors between GPUs every forward pass, so the compute fabric is provisioned for peak throughput. But collective operations (AllToAll, ReduceScatter) happen in short bursts with idle gaps between them, leaving most of that bandwidth unused most of the time. Meanwhile, decode engines have their own storage NICs that do nothing during prefill-heavy phases.

DualPath's idea: load KV-Cache through both prefill and decode engines' storage NICs, then use the high-bandwidth compute network to shuttle data from decode engines to prefill engines. This transforms storage I/O from a single-sided bottleneck into a distributed, schedulable resource across all nodes.

4. DualPath Architecture

DualPath: Two Paths for KV-Cache Loading Prefill Engine Node GPU (HBM) CNIC Buffer (DRAM) Storage NIC Decode Engine Node GPU (HBM) idle during KV load CNIC Buffer (DRAM) Storage NIC Persistent Storage (3FS) PE Read Path (conventional) RDMA DE Read Path (new) Request Scheduler Scheduler dynamically splits KV-Cache load across both paths

Animated: Baseline vs. DualPath

This animation shows a real scenario: Agent Turn 15 of an RL training rollout. The context is 32K tokens, 98.7% cached (31.6K tokens in KV-Cache on storage). Both systems must load the same amount of cached data before prefill can begin. Click "Start" to compare.

PE Read Path (Conventional)

This is the standard path. KV-Cache is read from persistent storage into the PE's DRAM buffer via its storage NIC. From DRAM, it moves through the CNIC to GPU HBM. During layerwise prefill, this happens one attention layer at a time: the PE reads one layer's KV-Cache into HBM, computes that attention layer, then loads the next.

DE Read Path (New)

This is DualPath's contribution. KV-Cache is read from storage into a decode engine's DRAM buffer (via the DE's otherwise-idle storage NIC). Then the DE's CNIC sends it to the PE's CNIC via high-bandwidth RDMA over the compute network. From there it flows into PE GPU HBM the same as the PE path.

Intuition: Instead of having one door into the building for deliveries (PE storage NIC), you open a second door through the neighboring building (DE storage NIC) and use an internal hallway (compute network RDMA) to move packages between buildings. The hallway has massive capacity and is only used in short bursts, so adding delivery traffic barely affects it.

Layerwise Prefill

A critical enabler for DualPath is layerwise prefill: instead of loading the entire KV-Cache (all layers) into GPU HBM at once, the system loads and processes one layer at a time. This is necessary because HBM capacity is limited, but it also means KV-Cache data is transferred in many small chunks (one layer's worth at a time).

This creates a design challenge. A model with 30 layers means 30 sequential load-compute cycles per request. Each load is a small transfer that must be efficiently overlapped with computation. DualPath uses Layer Blocks (shape: [1, tokens, bytes]) for these per-layer transfers and Full Blocks (shape: [layer, tokens, bytes]) for storage interactions.

5. CNIC-Centric Traffic Manager

Adding a second data path creates a practical problem: KV-Cache transfer traffic now shares the compute network and PCIe bus with latency-sensitive model execution operations (AllToAll for expert parallel, ReduceScatter for tensor parallel). These collective operations happen in sub-millisecond bursts and are critical for end-to-end latency.

The Problem with Existing Approaches

Existing GPU data transfer technologies (GPUDirect Storage, CUDA copy engine) don't provide fine-grained QoS control. They can't prevent KV-Cache traffic from interfering with collective communications. The paper measured CUDA copy engine overhead at 5-7 microseconds per operation, while RDMA write submission takes only ~1 microsecond.

CNIC as Central Traffic Controller

DualPath routes all GPU data traffic (including local H2D/D2H copies) through the GPU's paired CNIC using GPUDirect RDMA. This seems like a detour, but it has a key benefit: the CNIC becomes the single point of QoS control for all PCIe traffic.

For InfiniBand, DualPath uses Virtual Lanes (VLs) to isolate traffic:

This ensures KV-Cache traffic is essentially invisible to model execution. The same principle works on RoCE networks using Traffic Classes and DSCP markings.

6. Adaptive Request Scheduler

With two paths available, the system needs to decide for each request: which PE handles prefill, which DE handles decode, and which path loads the KV-Cache. A naive policy (e.g., round-robin assignment or always routing through the PE path first) can overload one side's storage NIC while the other sits idle, recreating the original bottleneck.

Inter-Engine Scheduling

The scheduler assigns each incoming request to a PE-DE pair. It picks PEs by checking two things: how backed up is the GPU (how many tokens are queued for computation), and how backed up is the disk (how many tokens are waiting to be read from storage).

The intuition: avoid sending new work to a PE that's already drowning in either compute or I/O. Prefer PEs that are light on both.

For DEs, scheduling is two-phase: first spread requests evenly across DE groups (by total token count), then within a group pick the DE with the most free HBM.

KV-Cache Read Task Scheduling

After a PE-DE pair is selected, the scheduler checks which side has the shorter storage read queue and routes the KV-Cache read through that side. This simple heuristic naturally balances storage NIC utilization across the cluster.

Intra-Engine Scheduling

Within a PE, the system uses a compute quota to decide how many requests to include in each forward batch. Each request is described by a pair $(cached, bsz)$: how many tokens have KV-Cache already available, and how many tokens need fresh computation. The scheduler estimates attention layer execution time and packs requests until reaching the quota.

If a request would exceed the quota, binary search finds a smaller $bsz'$ and performs chunked prefill on the remainder. This keeps GPU utilization high while preventing individual large requests from creating stragglers in data-parallel setups.

7. Results

Offline Inference (RL Training Rollouts)

Model Config DualPath Speedup
DS 660B 2P4D, 32K-64K context Up to 1.87x over baseline
DS 27B 1P1D Up to 1.78x over baseline
Qwen 32B 1P2D Similar trends to DS 27B

On DS 660B, DualPath nearly matches the Oracle configuration. Oracle is an idealized baseline where all KV-Cache is assumed to already be in GPU HBM (zero storage I/O). Matching Oracle means DualPath hides storage latency so effectively that the system behaves as if KV-Cache loading were free.

Online Serving

DualPath achieves 1.96x higher agent runs per second on average compared to the baseline. Key latency metrics:

Ablation: What Contributes What

On DS 660B with 64K context and 2048 agents, each component contributes cumulatively to JCT reduction: layerwise prefill alone gives 17% by hiding HBM transfer overhead, dual-path loading adds another 21% (total 38%) by doubling available storage bandwidth, and the adaptive scheduler adds another 8% (total 46%) by balancing load effectively.

Large-Scale Scalability

Tested up to 1,152 GPUs. Scaling from 2P4D (2K agents) to 48P96D (48K agents) achieves near-linear speedup with comparable JCT. For online serving, a 44P88D configuration achieves 22x throughput (8.8 vs 0.4 APS) while maintaining similar latency. Scheduler CPU stays below 10 cores.

8. Practical Notes

When does DualPath help most? When append lengths are short and context is long (high cache-hit ratio). With longer appends, GPU compute becomes the bottleneck instead of storage I/O, and DualPath's advantage shrinks. The paper shows that at 3x append length scaling, Basic performance approaches Oracle.

P/D ratio matters. DualPath and Basic perform comparably when they have equivalent total storage bandwidth. A Basic 1P2D system (one prefill node, two decode nodes) has the same storage bandwidth as DualPath 2P1D. The advantage of DualPath is that it can exploit any P/D ratio without wasting storage bandwidth on the idle side.

Key limitation: DualPath adds DRAM pressure on decode engines (DE buffer) and introduces additional PCIe traffic. The CNIC-centric approach, while enabling QoS, adds a small detour compared to direct GPUDirect Storage or CUDA copy. For small models where PCIe bandwidth is already tight, this overhead may not be negligible.

Implementation cost is modest. The entire DualPath implementation is approximately 5,000 lines of code on top of their existing inference framework, using FlashMLA, DeepGEMM, and DeepEP.

Storage backend. All experiments use 3FS (DeepSeek's distributed filesystem). The 3FS storage NIC has no internal DRAM cache and can saturate its 400 Gbps bandwidth. DualPath could be combined with a distributed DRAM cache (like Mooncake), but the paper notes the marginal performance gain is small.

Bottleneck-free range. The paper proves analytically that for typical configurations ($g=8$ GPUs per node, $s=1$ storage NIC, $M \approx 500$ GB/s memory bandwidth, $Bs \approx 50$ GB/s storage bandwidth), DualPath is bottleneck-free when $\frac{1}{7} \leq P/D \leq \frac{7}{2}$. This covers most practical deployments.


Reference: Shang et al., DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (2025)