Papers
Topics
Authors
Recent
2000 character limit reached

Unlimited In-Cache RX Processing

Updated 8 December 2025
  • The paper demonstrates that in-cache RX processing decouples per-packet operations from global working-set size, enabling line-rate throughput on SmartNICs.
  • It employs dedicated on-chip caches and programmable pipelines to manage dynamic flow and state lookups without incurring DRAM or PCIe bottlenecks.
  • Empirical results show up to 2.2Ă— higher throughput compared to microkernel baselines, highlighting its potential for scalable data center applications.

Unlimited-working-set in-cache processing on the RX path refers to architectural and algorithmic techniques that enable the receive datapath (RX) of a networked system—typically SmartNICs, programmable switching ASICs, or similar devices—to process arbitrarily large or rapidly changing sets of flows, rules, or connection states, while maintaining wire-rate throughput, by strategically leveraging high-speed on-chip caches and bypassing main memory for the majority of per-packet operations. This paradigm is critical for network stacks and appliances that must support dynamic, per-flow programmability without bottlenecking on memory bandwidth, PCIe transactions, or per-flow DMA operations.

1. Motivations and Architectural Context

In conventional networked systems, especially those built around CPU-centric stacks or traditional offload engines, the working set of active flows or processing rules is constrained to the size of fast, local memory or suffers from high-latency main memory accesses. As SmartNICs and programmable switches increase in programmability—introducing general-purpose cores, complex match-action tables, and user-accessible state—they must support large, frequently changing working sets typical of modern datacenter and disaggregated applications. The gap between wire speeds (up to 400 Gbps per port) and available DRAM or host-access bandwidth motivates designs that avoid payload or state lookups outside of fast, on-chip storage (Chen et al., 25 Apr 2025).

Unlimited-working-set in-cache RX path processing is enabled by separating fast path per-packet operations (such as header parsing and packet classification) from infrequent state reconciliations and moving these fast path operations entirely into on-chip caches or SRAM, thus bounding per-packet latency and decoupling it from the size of the global working set.

2. Mechanisms of In-Cache RX Processing

The primary design pattern for unlimited-working-set in-cache RX path processing involves:

  • Dedicated On-Chip Caches: Using the SmartNIC or switch’s SRAM or equivalent on-chip memory to store a working set of active flow/rule state for rapid lookup at line rate.
  • Programmable Pipelines: Exploiting match-action pipelines or SmartNIC Arm cores to implement programmable logic for packet classification, header parsing, and flow state updates without main memory access on each packet.
  • State Miss Handling: If a lookup misses in the on-chip cache, a background thread or pipeline synchronizes state from host memory or an external controller into the cache. Per-packet RX logic never blocks on these events, guaranteeing non-blocking, wire-rate operation.
  • Prefetching and Caching Policies: Advanced replacement and prefetch heuristics (e.g., LRU, LFU, or application-aware schemes) prioritize in-cache residency for the most active flows, effectively providing the illusion of an "unlimited" working set.

For example, FlexiNS on the BlueField-3 SmartNIC utilizes an in-cache RX processing pipeline, whereby flow tables, access control lists, or even per-flow protocol state, are kept resident in shared on-chip caches. The RX fast-path logic (packet parsing, flow demultiplexing, packet classification) is executed entirely in these caches, and only management, updates, or infrequent state lookups invoke more expensive DRAM operations or DMA fetches (Chen et al., 25 Apr 2025).

3. Concrete Realizations and Prototype Systems

FlexiNS, introduced by Li et al., is emblematic of this class of systems (Chen et al., 25 Apr 2025):

  • The SmartNIC-centric network stack's RX path caches RDMA QP state (e.g., access keys, WQE pointers, flow counters) and fast path protocol state in local L2 caches on the Arm cores.
  • Per-packet receive operations (header parsing, demux, ACK/NACK, protocol handling) are serviced from these caches, avoiding arm-DRAM or PCIe reads.
  • The "unlimited-working-set" claim is realized because the control plane (running on host or management cores) can asynchronously evict, insert, or refresh RX state—but the enum-lookup and action-execution on RX packets is bounded by on-chip cache hit latency.
  • This architecture allows FlexiNS to fully sustain 400 Gbps full-duplex packet rates for arbitrary numbers of QPs, only constrained by cache hit rates and not by global QP or flow table sizes.

A similar paradigm is observed in highly programmable switching ASICs, where match-action pipelines operate on large tables stored in on-chip TCAM/SRAM, combined with background table management to support arbitrarily large network-wide policies or routing tables (Goswami et al., 2020).

4. Comparative Performance and Implications

Empirical measurements from FlexiNS's implementation illustrate the value of unlimited-working-set RX path processing (Chen et al., 25 Apr 2025):

  • RX fast-path throughput is decoupled from the number of concurrent active flows or the total size of RX protocol state. For example, over hundreds of thousands of QPs, packet demultiplexing speed remains constant as long as the corresponding state fragments reside in cache.
  • Memory bandwidth usage is sharply reduced since cache fills and evictions only occur on state misses or table churn.
  • Can achieve 2.2Ă— higher throughput than a microkernel-based baseline and 1.3Ă— higher throughput than a hardware-offloaded baseline in representative datacenter data plane benchmarks.
  • Inferentially, as the system supports a larger working set, the cache miss rate may grow. However, carefully tuned caching and synchronization strategies can keep cache miss handling off the critical path, thus substantiating the "unlimited" working set claim from an RX processing perspective.

A direct implication of this architecture is the resulting decoupling of control and data plane operations. The RX path can process packets at line rate independent of background table maintenance, supporting full software transport programmability on the SmartNIC (Chen et al., 25 Apr 2025). The primary trade-off is that cache-efficient state representations and update policies require careful engineering:

System RX Path State Location Working-Set Scalability RX Throughput Dependency
CPU stack Host DRAM Bounded by cache/DRAM Linear in working-set size
Microkernel CPU cache/DRAM Bounded by DRAM Degrades with large working set
FlexiNS On-chip cache Effectively unlimited Constant; independent of set

In contrast, pure hardware-offloaded solutions (e.g., programmable switches) offer wire-rate throughput but limited flexibility in updating working sets or protocol behavior (Goswami et al., 2020). CPU-centric approaches cannot achieve the same independence of RX throughput from working-set size, as cache and memory bandwidth remain the bottleneck.

6. Applications, Limitations, and Future Directions

Unlimited-working-set in-cache RX processing is fundamental for large-scale, high-performance network functions, including RDMA virtualization, distributed storage, and programmable load balancing at scale. However, effectiveness depends on:

  • The size and associativity of on-chip caches relative to the temporal locality of practical working sets.
  • The latency and bandwidth of cache miss handling components (e.g., DRAM channels, PCIe links, and DMA engines).
  • The efficiency of control-plane synchronization to avoid excessive cache thrashing or stale state.

This approach enables systems such as FlexiNS to achieve high throughput for operations like block storage disaggregation and KVCache transfer, as demonstrated empirically (Chen et al., 25 Apr 2025). A plausible implication is continued research into adaptive cache management algorithms and hardware architectures that further expand on-chip cache sizes or enable hybrid in-cache/in-DRAM lookup pipelines, to better align wire speeds and software flexibility at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Unlimited-Working-Set In-Cache Processing RX Path.