Architectural Disaggregation & Parallelism

Updated 17 March 2026

Architectural disaggregation is the deliberate partitioning of monolithic systems into independent, scalable modules interconnected by high-bandwidth fabrics.
Module-level parallelism exploits concurrent operations within these modules to overcome latency and bandwidth constraints, enhancing overall performance.
Dynamic resource allocation, hardware-software co-design, and adaptive flow control are key strategies for addressing bottlenecks in disaggregated architectures.

Architectural disaggregation refers to the deliberate partitioning of previously monolithic compute, memory, or model structures into independently deployable, scalable modules interconnected by standardized, high-bandwidth fabrics. Module-level parallelism denotes the explicit exploitation of concurrency at the granularity of these disaggregated blocks—be they physical memory modules, compute chiplets, or modular neural network components—to accelerate throughput, hide remote access latency, and maximize resource utilization at scale. Disaggregation enables elastic scaling, heterogeneous integration, and dynamic resource pooling, while module-level parallelism is the principal mechanism for overcoming the challenges of increased access latency and bandwidth contention inherent to disaggregated architectures.

1. Foundations of Architectural Disaggregation

In the memory and systems domain, architectural disaggregation decouples resources such as memory and compute, allowing them to be independently scaled, pooled, or provisioned on demand. For example, CXL (Compute Express Link) memory systems expose remote DRAM devices to processors via standardized cache-coherent protocols over PCIe, instituting a two-tier hierarchy: local DDR memory with high bandwidth and concurrency, and remote CXL-attached memory with higher latency and reduced per-module parallelism (Yang et al., 22 Mar 2025). Disaggregation extends to chiplet architectures, where independently manufactured IP blocks are integrated at package-level via high-bandwidth, low-latency on-chip networks, and to deep learning inference, where network modules such as attention layers and feed-forward experts in transformer models are deployed on physically distinct acceleration resources (Fox et al., 2022, Zhu et al., 3 Apr 2025).

Fundamentally, disaggregation shifts the performance balance from intra-module locality and single-resource optimization towards inter-module communication, coordination, and parallel resource utilization. Consequently, the bottlenecks transition from on-die buses and shared memory fabrics to interconnect bandwidth, queue management, and out-of-module latency.

2. Module-Level Parallelism: Definitions and Measurement

Module-level parallelism refers to the degree of simultaneous independent operations that can be issued to or within each disaggregated module—such as the number of outstanding memory fetches a DRAM or CXL module can process, or the number of codelets executed concurrently across chiplets. In memory systems, this encompasses bank-level, channel-level, and inter-module concurrency (Yang et al., 22 Mar 2025). For disaggregated compute, it is precisely characterized via directed acyclic graphs (DAGs) derived from instruction traces or program representations: the out-degree of concurrent vertices quantifies the achievable memory-level parallelism (MLP) at any instant (Shen et al., 15 Dec 2025).

EDAN exemplifies this: by constructing the execution DAG from instruction traces, it exposes $W$ (total cache-miss vertices) and $\mathcal D$ (maximum number of dependent misses on a critical path), directly calculating parallelism as $\min(m, W/\mathcal D)$ , where $m$ is maximum outstanding issues per cycle. In accelerator-domain chiplet or MoE architectures, module-level parallelism manifests as concurrent data-parallel or pipeline-parallel execution across functionally specialized chip modules or GPU micro-batch streams (Fox et al., 2022, Zhu et al., 3 Apr 2025).

Quantitatively, module-level parallelism is measured by:

Peak and sustained bandwidth per module, scaling with the number of concurrently issuing sources (Yang et al., 22 Mar 2025).
Effective queue depth and request queue occupancy under load, reflecting the hardware’s ability to absorb concurrent accesses.
Critical-path span and schedule-based metrics in execution DAG models, bounding performance gains under various $m$ and latency regimes (Shen et al., 15 Dec 2025).

3. Disaggregated Systems: Architectural Mechanisms

Disaggregated memory systems adopt several architectural structures to maximize module-level parallelism:

Memory Systems (CXL, DaeMon, Tiered DDR–CXL)

Explicit Channel/Module Provisioning: Scaling the number of CXL or DDR channels increases bandwidth and parallelism linearly until limited by protocol or device constraints (Yang et al., 22 Mar 2025).
Dynamic Memory Request Control (MIKU): Monitors per-module queue service times via hardware PMUs, dynamically throttles CXL-originated requests during congestion to preserve DDR performance, and adapts access patterns at runtime (Yang et al., 22 Mar 2025).
Bandwidth Partitioning and Decoupled Engines (DaeMon): Independent sub-block and page engines per CC/MC pair, with hardware partitioners enforcing service rate splits, adaptive granular request scheduling, and link-level lossless compression to optimize bandwidth utilization and fetch latency, ensuring parallel progress of bulk and fine-grained accesses (Giannoula et al., 2023).

Compute Chiplet Architectures (Codelet Model)

Fine-Grained Task Graph Scheduling: Work is statically or dynamically mapped to chiplet-local compute units via codelet graphs, with hardware-supported dependency counters, streaming FIFOs, and explicit data-movement engines for inter-chiplet transfers (Fox et al., 2022).
Coherence Domain Restriction: Data-coherence is scoped to clusters or chiplets to remove global bottlenecks, with explicit point-to-point DMA transfers for inter-module communication.

Composable GPUs (COPA-GPU)

Separation of Compute and Memory Modules: Streaming multiprocessors reside in their own chiplet; L3 and DRAM modules are split into adjacent chiplets interconnected with ultra-high-bandwidth links (Fu et al., 2021).
L2→L3/DRAM Switches and Bank-Parallel Structures: Microarchitectural switches route global memory traffic to independent MSMs, each with banked L3 caches and HBM stacks, maximizing concurrent servicing and bandwidth scaling.

Disaggregated Deep Learning (MoE, Attention–FFN Split)

Per-Module Parallel Strategy: Attention modules receive intra-batch data-parallel or tensor-parallel execution, while FFN experts operate in model-parallel or expert-parallel patterns, often residing on specialized, cost- or memory-optimized accelerators (Zhu et al., 3 Apr 2025).
Pipeline and Micro-Batch Pipelining: Ping-pong or three-batch overlap strategies pipeline micro-batch compute and communication to maintain high hardware utilization across disaggregated roles (Zhu et al., 3 Apr 2025, Liu et al., 10 Feb 2026).

4. Performance Modeling and Theoretical Limits

The performance of disaggregated systems under module-level parallelism is fundamentally bounded by hardware and workload topology:

Brent-style Parallel Execution Bounds: For $W$ total misses, critical-path length $\mathcal D$ , pipeline width $m$ , and miss-latency $\alpha$ , the execution time is tightly bounded by $\max(\mathcal D, W/m)\alpha$ and $((W-\mathcal D)/m + \mathcal D)\alpha$ (Shen et al., 15 Dec 2025).
Bandwidth Scaling Laws: Total attainable memory bandwidth is $BW_{1T}\cdot N_{mod}$ , capped by the slowest module’s bandwidth under heavy concurrency (Yang et al., 22 Mar 2025).
Roofline Models for Compute–Communication Tradeoff: Disaggregation in MoE inference is governed by the balance between arithmetic intensity, interconnect (scale-out/scale-up) bandwidth, and operator active time, resulting in well-defined “dead zones” where communication, not compute, becomes the limiter (Liu et al., 10 Feb 2026). In attention–FFN disaggregation, increased FFN instances yield diminishing returns when inbound tokens per rank are capped by network bandwidth rather than compute local capacity.

A synthesis of these models underscores that parallelism within and across modules is necessary but not sufficient—interconnect bandwidth, queue depth, and scheduling granularity are co-determinants of end-to-end system throughput and utilization.

5. Empirical Case Studies and Design Implications

Multiple benchmarks and architectural studies offer empirical quantification of module-level parallelism:

System/Workload	Parallelism-Limited Metric	Parallelism Enhancement
PolyBench (dense kernels) (Shen et al., 15 Dec 2025)	$\mathcal D =$ const as $N \rightarrow$ large	$\gg$ MLP via in-flight misses
HPCG, LULESH (sparse HPC) (Shen et al., 15 Dec 2025)	Large $W$ , moderate $\mathcal D$	Caches reduce $W$ / $\mathcal D$ by >70%
CXL-enabled DDR–CXL (Yang et al., 22 Mar 2025)	DDR BW loss of 81% with naive concurrency	MIKU restores BW by throttling CXL
MegaScale-Infer (MoE inference) (Zhu et al., 3 Apr 2025)	Module-specific pipeline/copy bottlenecks	Up to 1.90× throughput, 4.2× lower comm. latency
DaeMon (disaggregated CC/MC) (Giannoula et al., 2023)	Cache-line / page traffic interference	Bandwidth partitioning, parallel DMA engines

Key insights include:

For memory bandwidth–limited codes, high MLP (low $\mathcal D$ vs $W$ ) enables performance retention under disaggregation by amortizing remote-access latency over many in-flight requests. Data-oblivious kernels (e.g., matrix-multiply) are robust; pointer-chasing codes are not (Shen et al., 15 Dec 2025).
Hardware and system software must provision request issue slots or engines (e.g., multiple CXL/MM modules, sub-block DMA engines) commensurate with the native parallel workload; under-provisioning induces queueing bottlenecks.
Dynamic, feedback-driven resource management (MIKU in tiered memory, selection granularity in DaeMon) is necessary to avoid head-of-line blocking and unfair bandwidth contention.
In neural network inference, module-level pipeline depth and balanced mapping are key; micro-batching and efficient communication libraries (M2N) ameliorate the effects of asymmetric resource demand (Zhu et al., 3 Apr 2025).
Attention–FFN disaggregation (AFD) in MoE is only beneficial if interconnects are superpod-class and expert granularity is coarse, otherwise bandwidth dead zones and imbalance penalties dominate (Liu et al., 10 Feb 2026).

6. Limitations, Practical Design Guidelines, and Applicability

While disaggregation and module-level parallelism enable scalability and heterogeneity, several fundamental trade-offs and caveats are documented:

The benefit of additional module-level parallelism saturates under bandwidth or queueing bottlenecks; there exists a regime (“dead zone”) where further scaling of modules or micro-batching cannot increase throughput (Liu et al., 10 Feb 2026).
Discrete role scaling (AFD) is more susceptible to imbalance and pipeline bubbles than continuous adjustment schemes (EP); performance penalty curves reveal that continuous tuning provides superior stability except at certain integral ratios.
For real-world memory systems, static resource allocations (e.g., fixed throttling or bandwidth reservations) cause underutilization; dynamic monitoring and adaptive flow control are preferred (Yang et al., 22 Mar 2025, Giannoula et al., 2023).
Small, high-speed caches remain critical adjuncts to module-level parallelism, absorbing locality and reducing remote-access burden in memory-disaggregated designs (Shen et al., 15 Dec 2025).
Disaggregation is most effective when combined with architecture-aware software (e.g., data placement, pipeline depth) and hardware co-design (e.g., fine-grained DMA, message-passing, or credit-based flow control).

A plausible implication is that future systems must expose explicit module-level parallelism not only via hardware primitives, but also via software abstractions that can dynamically adapt mapping, pipelining, and resource allocation based on observed workload and network regime.

7. Future Directions and Research Opportunities

The frontiers of architectural disaggregation and module-level parallelism are strongly influenced by advances in on-package interconnects, system-level network fabrics, and adaptive, programmable flow-control mechanisms. Trends include:

Increased chiplet heterogeneity and integration of domain-specific accelerators under uniform program execution models, leveraging stateless task graphs for cross-module execution (Fox et al., 2022).
Enhanced system software for dynamic data/resource placement in tiered disaggregated environments (e.g., hot/cold page migration, per-module monitoring).
Performance modeling tools (e.g., EDAN-style DAG analyzers) that accelerate design-space exploration of disaggregated architectures without elaborate hardware prototyping (Shen et al., 15 Dec 2025).
New memory system protocols and DMA interfaces capable of fine-grained bandwidth partitioning, decomposable granularity, and robust consistency maintenance across high-latency fabrics (Giannoula et al., 2023).
Co-design of distributed model architectures and deployment strategies in large neural inference—balancing expert sparsity, micro-batch pipelining, and hardware mapping under measured cluster interconnect topologies (Zhu et al., 3 Apr 2025, Liu et al., 10 Feb 2026).

Ultimately, the field is converging toward architectures and software stacks where disaggregation and module-level parallelism are first-class, transparently orchestrated abstractions that enable efficient, scalable, and heterogenous deployment across the full spectrum of high-performance computing and machine learning workloads.

Markdown Report Issue Upgrade to Chat

References (7)

Architectural and System Implications of CXL-enabled Tiered Memory (2025)

Chiplets and the Codelet Model (2022)

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (2025)

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC (2025)

Architectural Support for Efficient Data Movement in Disaggregated Systems (2023)

GPU Domain Specialization via Composable On-Package Architecture (2021)

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Architectural Disaggregation and Module-Level Parallelism.

Architectural Disaggregation & Parallelism

1. Foundations of Architectural Disaggregation

2. Module-Level Parallelism: Definitions and Measurement

3. Disaggregated Systems: Architectural Mechanisms

Memory Systems (CXL, DaeMon, Tiered DDR–CXL)

Compute Chiplet Architectures (Codelet Model)

Composable GPUs (COPA-GPU)

Disaggregated Deep Learning (MoE, Attention–FFN Split)

4. Performance Modeling and Theoretical Limits

5. Empirical Case Studies and Design Implications

6. Limitations, Practical Design Guidelines, and Applicability

7. Future Directions and Research Opportunities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Architectural Disaggregation & Parallelism

1. Foundations of Architectural Disaggregation

2. Module-Level Parallelism: Definitions and Measurement

3. Disaggregated Systems: Architectural Mechanisms

Memory Systems (CXL, DaeMon, Tiered DDR–CXL)

Compute Chiplet Architectures (Codelet Model)

Composable GPUs (COPA-GPU)

Disaggregated Deep Learning (MoE, Attention–FFN Split)

4. Performance Modeling and Theoretical Limits

5. Empirical Case Studies and Design Implications

6. Limitations, Practical Design Guidelines, and Applicability

7. Future Directions and Research Opportunities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research