Octopus: Scalable Low-Cost CXL Memory Pooling (2501.09020v2)

Published 15 Jan 2025 in cs.AR

Abstract: Compute Express Link (CXL) enables compute "pods" with memory pooling across hosts to reduce cost and improve efficiency. Existing pods are small, use exotic many-ported pooling devices, or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all hosts to all CXL pooling devices. This paper breaks with this conventional wisdom to create "Octopus" pods. Octopus connects each host to a bounded number of pooling devices (e.g., 8), each pooling device connects to different subsets of hosts, and all host pairs share at least one pooling device. Despite no longer having a global memory pool, we show that Octopus pods still effectively support memory pooling, as well as various communication patterns. Relative to conventional pods, Octopus is more cost-effective (using near-commodity pooling devices) and enables larger pods (allowing more pooling flexibility and greater communication reach). Simulations on production traces show Octopus achieves memory savings comparable to expensive pool designs. Hardware experiments confirm that Octopus reduces RPC latency by 3x compared to RDMA. Our work formalizes Octopus topologies, develops memory allocation algorithms, and evaluates performance tradeoffs through simulation and hardware testing.

Summary

The paper introduces Octopus, a novel design for CXL memory pooling that uses asymmetric topologies and small Multi-Headed Devices (MHDs) to connect more hosts at lower cost per host.
The design classifies asymmetric topologies (Regular, Dense, Sparse) and shows that configurations can connect up to three times as many hosts with a 17% reduction in cost per host compared to existing symmetric designs.
Practical considerations include software-based memory interleaving at page granularity to manage bandwidth and dynamic memory allocation strategies tailored for different use cases like resource pooling and communication queues.

The paper introduces "Octopus," a cost-effective design for Compute Express Link (CXL) memory pools, addressing the limitations of existing proposals that rely on expensive CXL switches or large multi-headed devices (MHDs). The central concept involves asymmetric CXL topologies, where hosts connect to distinct sets of small CXL devices, facilitating memory pooling and sharing across multiple hosts. This approach leverages readily available hardware and demonstrates a trade-off between CXL pod size and cost per host. The authors assert that Octopus enhances the Pareto frontier compared to existing policies, potentially connecting three times as many hosts with a 17% reduction in cost per host.

The paper highlights the significance of CXL as an interconnect standard that enables memory disaggregation in data centers. CXL provides memory semantics with bandwidth scaling comparable to PCIe but with lower latency. The adoption of CXL by CPU vendors, device manufacturers, and data center operators underscores its importance. While CXL allows multiple hosts to access a shared memory pool, the standard lacks specific guidelines for constructing such a pool.

The paper uses the term CXL pool to describe a specific CXL pooling device. A server may be connected to multiple CXL pools. The authors use the term CXL pod to refer to a set of servers that are connected to overlapping sets of CXL servers and can share CXL memory resources. The paper aims to maximize CXL pod size (number of hosts) while minimizing the price per host, primarily influenced by the choice of CXP pooling devices.

The paper reviews CXL concepts, controller designs, and topologies, emphasizing the CXL.mem protocol where CPUs forward load/store instructions to an external CXL memory controller (EMC). EMCs connect to CPUs via CXL ports with physical PCIe pins and cables, with cable length limitations due to signal integrity. An EMC typically supports multiple DRAM channels (DDR4 or DDR5). CPUs often have multiple CXL ports and interleave data across multiple EMCs at a 256B granularity.

MHDs, featuring multiple CXL ports, enable multiple CPUs to access the same memory. The paper presents a Microsoft MHD with two x8 CXL ports as an example. Besides EMCs and MHDs, CXL switches offer multiple CXL ports and facilitate packet forwarding.

The paper analyzes feasible MHD designs and estimates die area as a proxy for cost. Estimates are based on a 5nm MHD die area model by ARM and the 6nm design of the AMD Zen4 IO Die. MHDs with varying numbers of CXL ports and DDR5 channels (XSmall, Small, Large, XLarge) are compared. The die area estimates for XSmall, Small, Large, and XLarge MHDs are 30, 69, and 181 mm $^2$ , respectively. The increase in area is due to the large number of IO pads required for the CXL and DDR5 ports.

These die area estimates are translated into relative cost estimates, assuming a standard 300mm wafer and Murphy's law for die yield. The XSmall and Small MHDs are 0.19x and 0.42x the cost of the Large MHD, respectively, while the XLarge MHD is 3x the cost.

The paper discusses existing CXL pod designs, including switched CXL pods and symmetric CXL pods based on MHDs. Switched designs offer flexibility but suffer from higher power consumption, cost, and latency due to the need for serialization and deserialization of CXL packets. Symmetric CXL pods, where CPUs connect directly to MHDs, reduce cost, power, and latency compared to switched topologies. In symmetric topologies, all CPUs connect to the same set of MHDs, with the MHD port count determining the CXL pod size.

The Octopus design extends CXL pods to include more CPUs than an N-ported MHD would typically allow. By employing multiple MHDs and connecting each MHD to a distinct set of servers, Octopus achieves larger CXL pod sizes. For instance, a small 4-ported MHD can connect to 13 hosts, reducing the cost per host.

The paper introduces a classification of Octopus topologies:

Regular: Any pair of hosts in the CXL pod connects to exactly one MHD.
Dense: Pairs of hosts connect to more than one MHD.
Sparse: Pairs of hosts can be arbitrarily far away.

Regular Octopus topologies increase pod size multiplicatively with the number of MHD ports ( $X$ ), while symmetric topologies maintain a constant pod size. Sparse Octopus topologies scale at $O(X^2)$ or faster. A regular Octopus topology can be formalized as a balanced incomplete block design (BIBD), arranging $v=H$ treatments in $b=M$ blocks of size $k=N$ each.

The paper addresses memory interleaving across MHDs, crucial for achieving high memory bandwidth in CXL pods. In Octopus topologies, hardware interleaving is challenging due to uneven memory allocation across MHDs. The authors propose software-based interleaving policies at page granularity (4kB, 2MB, 1GB), enabling dynamic interleaving decisions based on capacity ratios.

The paper discusses memory allocation strategies for three use cases:

Resource Pooling: Pool capacity is allocated proportionally to the available capacity on all connected MHDs, balancing memory bandwidth and long-term system fairness.
Shuffling and Communication Queues: For regular Octopus topologies, a shuffling input queue and/or message receive queue is allocated for every pair of hosts on their common MHD, ensuring consistent communication latency.
Pod-wide AllGather and Transactions: Multi-hop communication is needed, which makes Octopus topologies less suitable than symmetric topologies.

The paper presents practical Octopus examples, considering that hosts commonly connect to devices using 32-64 CXL 2.0 / PCIe5 lanes. The paper analyzes several practical configurations, considering their pod size and the number of MHDs needed. Symmetric topologies define a Pareto frontier where costs increase with CXL pod size, but Octopus topologies enhance this frontier by achieving larger pod sizes at equal or lower cost.

In summary, the paper introduces Octopus as a cost-effective and scalable design for CXL memory pooling, leveraging asymmetric topologies and small MHDs to improve upon existing switched and symmetric approaches. The analysis includes MHD design considerations, topological classifications, memory interleaving strategies, and memory allocation algorithms, culminating in practical deployment examples.