ImmCounter Primitive: RDMA Completion Notifications
- ImmCounter primitive is a transport-agnostic mechanism that reliably tracks one-sided RDMA WriteImm completions without relying on message ordering.
- It aggregates completion counts across multiple NICs and GPU domains, enabling scalable and asynchronous operations critical for LLM architectures such as disaggregated inference and Mixture-of-Experts routing.
- The implementation uses atomic counters and background CQ polling to ensure thread safety, portability across hardware like NVIDIA ConnectX and AWS EFA, and eliminates vendor lock-in.
The ImmCounter primitive is a hardware-agnostic, point-to-point completion notification mechanism introduced in the TransferEngine library for RDMA WriteImm operations to support modern, asynchronous LLM system architectures. ImmCounter addresses the challenge of robust completion detection for one-sided RDMA writes across heterogeneous network transports—which may, or may not, guarantee in-order delivery—and is engineered to be fully portable across NIC (Network Interface Controller) hardware, including NVIDIA ConnectX (with reliable connections/RC) and AWS Elastic Fabric Adapter (EFA) (with scalable reliable datagrams/SRD).
1. Rationale and Design Goals
The ImmCounter primitive was developed to satisfy several interlocking requirements fundamental to new LLM system patterns:
- Portability: Seamless operation across different RDMA hardware and transport protocols, especially environments that lack guaranteed message ordering (ConnectX RC vs. EFA SRD).
- Efficient and reliable notification: Accurate completion notification for one-sided WriteImm transfers without requiring implicit ordering assumptions or per-message handshakes.
- Scalability and asynchronicity: Support for high-concurrency, groupwise notification and multi-NIC/GPU sharding, essential in disaggregated inference, Mixture-of-Experts routing, and large-scale RL weight transfer.
- Simplicity and transparency: Eliminates vendor lock-in by not relying on network-transport-specific notification or GPU-originated RDMA functionality.
2. Primitive Semantics and Mechanism
2.1. WriteImm Interaction
WriteImm is an RDMA operation that transmits both payload data and a 32-bit immediate value to a target memory region at the receiver. On the receive side, the immediate is delivered in a completion queue (CQ) event entry.
- Each WriteImm operation increments a software-maintained counter associated with its immediate value.
- The ImmCounter is a logical per-imm counter, tracking the number of completions observed for the immediate.
2.2. Completion Detection and Notification
ImmCounter provides precise completion semantics via direct CQ polling and immediate-count aggregation:
- The receiver registers its expectation for a given immediate using
where1
fn expect_imm_count(imm: u32, count: u32, cb: fn() -> ());
cbis a callback invoked oncecountcompletions with immediateimmhave been observed. - CQ polling is performed in a background thread; each event with a WRITE_IMM type increments the atomic counter for its immediate.
- Once the counter reaches the user-registered threshold, the callback is triggered.
2.3. Independence from Transport Ordering
ImmCounter explicitly does not rely on any ordering guarantees from the underlying network transport. Instead:
- The only invariant is that all WriteImm events will eventually appear in the CQ.
- Arrival order is not considered; correctness depends solely on the total count of completions, not sequence.
- This design is essential for full compatibility between ConnectX RC (where in-order is possible) and EFA SRD (fundamentally out-of-order).
2.4. Multi-NIC and DomainGroup Aggregation
- Counters for each immediate value are managed per
DomainGroup, the TransferEngine abstraction for GPU-attached NICs (possibly >1 per GPU). - Immediate arrival counts are aggregated across all NICs—critical for correct notification on sharded infrastructure.
- Sharding and balancing of notification and data transfer is handled transparently, so all WriteImm completions received for a GPU are globally tracked for that destination independently of the number or mapping of NICs.
3. Implementation Overview
3.1. Threading and Synchronization
- Completion queue polling is executed by worker threads pinned to the same NUMA node as the associated GPU.
- Per-imm counters are implemented as atomics, guaranteeing thread safety and callback correctness even under heavy concurrency.
- Notifications are delivered via registered callbacks; flag-based or other synchronization mechanisms may be used for applications needing blocking semantics.
3.2. Hardware-specific Optimizations
- On ConnectX hardware, two RC queue pairs per peer are employed: one for Send/Recv, another for Write/WriteImm. This avoids cross-traffic interference in CQ event handling, since receivers must disentangle mixed Receive and WriteImm events.
- For EFA, which only provides out-of-order semantics (SRD), ImmCounter’s arrival-counting approach natively handles concurrency.
3.3. API Usage Pattern
Sender:
1 2 3 4 5 6 7 |
fn submit_single_write(
len: u64,
imm: Option<u32>,
src: (MrHandle, Offset),
dst: (MrDesc, Offset),
cb: OnDone
); |
1 |
expect_imm_count(imm, count, cb); |
imm increments to count. No ordering guarantees on CQ event arrival are assumed.
4. Integration with LLM System Patterns
ImmCounter is applied in production to support several key LLM communication patterns:
4.1. Disaggregated Inference
- Enables elastic scaling by allowing decoders to transfer KV Cache pages from dynamically chosen prefillers.
- The decoder registers how many KV pages it should receive with a given immediate, calls
expect_imm_count, and decodes as soon as all have arrived—no explicit sender completion messages or global synchronizations required.
4.2. Mixture-of-Experts Routing
- Low-latency, high-concurrency operation: MoE dispatch/combine phases may involve numerous bulk token transfers.
- ImmCounter allows the receiver to determine precisely when group-wise token arrival is complete, which enables phase transition without handshake or sequencing complexities.
- Efficient group operations (
submit_scatter,submit_barrier) can use immediates for fine-grained notification.
4.3. Asynchronous Reinforcement Learning Updates
- Training GPUs may independently push new weight shards to inference GPUs.
- Each inference GPU uses ImmCounter to discover completion—when all weight shards have been received, a cluster-scale update completes (demonstrated at 1.3 seconds for trillion-parameter models).
- Removes the need for explicit rank-0 coordination and enables full bandwidth utilization; critical, given that with EFA, completions can arrive out-of-order.
5. Distinctive Properties, Advantages, and Comparison
5.1. Core Properties
- No reliance on message order: Semantics are determined exclusively by CQ event counts, never by sequence.
- Full transport agnosticism: No knowledge of RDMA hardware ordering or network-level delivery is required.
- Multi-NIC and multi-GPU aware: Counters reflect the aggregate state across all relevant hardware resources.
- Transparent and scalable: Works identically on either ConnectX (in-order possible) or EFA (out-of-order alone), making it uniquely suited for vendor-neutral, cloud-scale LLM deployments.
5.2. Benefits vs. Alternatives
| Property | ImmCounter | Collective Libraries/Ordered Schemes |
|---|---|---|
| Ordering sensitivity | None | Typically required |
| Vendor lock-in | None | Often significant |
| Multi-NIC transparency | Yes | Frequent manual intervention |
| Group notification support | Built-in | Typically nontrivial |
| Suited to asynchronous LLM | Yes | Fundamental limitations |
- Essential for workloads requiring decoupled, asynchronous transfer patterns—including the ones not accommodated by existing collective libraries.
- Removes reliance on vendor-specific ordering features (e.g., GPU-initiated RDMA or ConnectX-specific ordering), thus unlocking cross-cloud deployment.
6. Visual Summary and Program Logic
High-level logic (editor’s summary):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// Initialization
counters = dict<imm_value, AtomicInteger>()
function expect_imm_count(imm, count, callback):
if counters[imm] >= count:
callback()
else:
// Register callback for when counters[imm] == count
// CQ polling thread
while true:
event = poll_completion_queue()
if event.type == WRITE_IMM:
counters[event.imm] += 1
if counters[event.imm] == watched_count:
callback() |
7. Application Impact and Portability
- ImmCounter is critical for enabling high-throughput, high-concurrency, and asynchronous operations in LLM clusters, as attested by production use cases demonstrating 400 Gbps throughput across ConnectX-7 and EFA hardware.
- By eliminating ordering dependencies, ImmCounter affords true interoperability between on-premises and cloud hardware, substantially reducing vendor lock-in for operators scaling LLM systems.
- Its efficient group synchronization and notification primitives substantially simplify distributed implementation of barriers, scatters/gathers, and many-to-one/one-to-many notification mechanisms essential in modern NLP infrastructure.
In summary, ImmCounter enables robust, scalable, and transport-agnostic notification of one-sided RDMA transfer completion, making it indispensable for disaggregated, dynamic, and heterogeneous LLM system architectures (Licker et al., 31 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free