Domain Parallelism & Co-Design

Updated 6 April 2026

Domain parallelism is the use of workload-specific partitioning, such as tensor or spatial sharding, to distribute computation and increase throughput.
Co-design integrates algorithmic strategies with hardware architectures—using methods like RL-based search and DSLs—to optimize performance and energy efficiency.
Empirical results demonstrate significant speedups (up to 3.5× to 761×) across diverse applications, highlighting the impact of combining parallelism with joint design approaches.

Domain parallelism denotes the exploitation of problem- or operator-specific partitioning—often along a domain inherent in the computation, such as tensor axes, spatial fields, frequency/wavelength, data tables, or circuit sub-blocks—to distribute computational load over multiple hardware units and maximize concurrency and throughput. Co-design refers to the systematic, often automated, joint optimization of both algorithmic (software) and architectural (hardware) choices that maximize efficiency, scalability, or energy per operation by leveraging domain parallelism. This paradigm is central in domains ranging from distributed LLM inference and DLRM training, to quantum control, photonic computing, linear algebra, genome analysis, and stream-processing FPGAs, as demonstrated in leading research (Yin et al., 29 Aug 2025, Mudigere et al., 2021, Agrawal et al., 2023, Ronde et al., 18 Nov 2025, Zhou et al., 31 Dec 2025, Gholami et al., 2017, Orenes-Vera et al., 2022, Sano, 2015, Zacharopoulos et al., 2022, Cali, 2021, Merchant et al., 2016).

1. Principles and Taxonomy of Domain Parallelism

Domain parallelism is parameterized along axes intrinsic to the underlying workload or algorithm:

Tensor and Operator Axes (Transformers/DNNs): Sharding along hidden, feed-forward, or attention head dimensions allows splitting computation across multiple NPUs (Tensor Parallelism, TP), sometimes further specialized for MoE architectures via Expert Parallelism (EP) (Yin et al., 29 Aug 2025, Agrawal et al., 2023).
Spatial/Spectral/Temporal/Mode Domains (Photonics): Parallel computation proceeds over wavelength-division multiplexed (WDM) channels, spatial waveguides, or time-division slots (Zhou et al., 31 Dec 2025).
Graph/Array/Sequence Domains (Sparse/Graph Kernels, Genomics): Partitioning nodes, edges, or sequence segments among tile-grids or specialized processing elements (Orenes-Vera et al., 2022, Cali, 2021).
Instruction/Controller Domains (Quantum): Broadcasting identical instructions across qubit controller subnets to execute parallel, same-parameter operations (Ronde et al., 18 Nov 2025).
Embedding Table Dimensions (RecSys): Row-wise and column-wise sharding of high-cardinality embeddings across workers, combined with table-wise and data-parallel axes (Mudigere et al., 2021).
Loop/Task/Pipeline Hierarchies: Replicating hardware units across loop trip-counts, independent functional kernels, or pipeline stages (Zacharopoulos et al., 2022).
Domain and Model Partitioning (DNN Training): Using a 2D process grid to simultaneously slice by model/domain and batch axes for optimal load balance (Gholami et al., 2017).

Domain parallelism is typically orthogonal or complementary to classic data (batch) parallelism, enabling scaling beyond the per-example concurrency limit or allowing strong scaling in bandwidth-bound applications.

2. Joint Algorithm-Hardware Co-Design Frameworks

Domain-parallel strategies achieve their full potential via integrated co-design approaches:

RL-based Joint Search: In distributed LLM inference, "Learn to Shard" (Yin et al., 29 Aug 2025) formulates the setting of (TP, EP, PP) degrees and operator sharding dimensions as a multi-discrete reinforcement learning problem, leveraging a transformer policy network over elite historical configurations to rapidly converge to high-throughput sharding strategies, outpacing static heuristics (e.g., Megatron-LM) with up to 3.5× throughput gains.
Cross-Layer Photonic Co-Design: SimPhony (Zhou et al., 31 Dec 2025) coordinates device, circuit, architecture, and algorithm-level simulation, supporting domain-parallel mapping along spectral, spatial, and temporal modernities and injecting device-level constraints directly into ML graph partitioning and scheduling.
Task & Data Placement: Dalorex (Orenes-Vera et al., 2022) divides data structures into equal-sized local domains per tile, leverages task dispatch to local data via fine-grained task queues, and scales to >16,000 tiles with O(10¹²) edges/s throughput by aligning distributed memory, compute, and NoC design.
Hierarchical Hardware/Software Search: DEAP (Agrawal et al., 2023) unifies hardware settings (chip count, topology, on-chip/off-chip memory) and software partitioning (tensor vs pipeline splits) within a simulation-driven design space exploration, surfacing Pareto-optimal compute/fabric configurations.
Spatial/Temporal Design DSLs: SPD (Sano, 2015) expresses both computation and parallel structure for stream FPGAs, allowing high-level sweep over parallelism factors and automatic HDL generation.
Hierarchical Parallelism Extraction: Trireme (Zacharopoulos et al., 2022) models loop-, task-, and pipeline-level parallelism, mapping them to hardware accelerators under area constraints and automatically searching for speedup-maximizing configurations.

3. Quantitative Impact and Performance Results

Domain parallelism, when tightly coupled with algorithm-hardware co-design, yields dramatic improvements:

"Learn to Shard" on H100 clusters (MoE models up to 1.6T params) achieves up to 3.5× throughput over metaheuristics, and routinely 1.06× over Megatron heuristics (Yin et al., 29 Aug 2025).
Dalorex surpasses prior PIM systems (e.g. Tesseract) by 221× runtime and 325× energy on sparse/graph codes, with tile utilization >90% due to well-aligned data and task placement (Orenes-Vera et al., 2022).
Neo/ZionEX (RecSys): Up to 40× time-to-solution speedup on 12T-param DLRMs, with 4D domain-wise and batch sharding, achieves peak throughput (3.4M QPS, scaling at 75%) (Mudigere et al., 2021).
Quantum Compiler/Hardware Co-Design: Parallel instruction issue and reordering yields up to 16.5× average and 56.2× peak speedup on benchmarks, demonstrating the benefit of clustering parallel same-parameter gates (Ronde et al., 18 Nov 2025).
FPGA streaming (SPD): The optimal config for LBM fluid solver (1 pipeline×4-deep) attains >94 GFlop/s at 2.4 GFlop/s/W; overprovisioned spatial replication is bandwidth-limited (Sano, 2015).
Genome analysis (GenASM, BitMAc): Systolic domain parallel architectures yield 92–761× CPU/GPU speedups, 41–539× over baseline for sequence-to-graph alignment, and up to 34× improvements in throughput/W or mm² for ASIC (Cali, 2021).
Trireme: XR pipelines and DLA applications see 3.4–27× speedups (LLP), 6–21× (TLP/PP), 2–3× over prior one-level HW selection (Zacharopoulos et al., 2022).

4. Methodological Patterns and Architectures

The central architectures underpinning domain-parallel co-design include:

Process Grids and SUMMA: For DNN training, a P_r×P_c process grid supports arbitrary combinations of model/domain and batch parallelism, with communication complexity balanced across axes, and communication-efficient reduction of weight and activation partitions (Gholami et al., 2017).
Attention-over-History Networks: RL-driven policy networks using transformer architectures to attend over elite history for better exploration/exploitation in domain- and parallelism-space (Yin et al., 29 Aug 2025).
Task Scheduling Units (TSU): In Dalorex, occupancy-based queue ratios drive high core utilization and balance irregular, memory-bound kernels (Orenes-Vera et al., 2022).
Broadcast/DAG Clustering for Quantum: Hierarchical controller blocks issue broadcast instructions to clusters of node controllers; compiler reordering forms wide operand clusters (Ronde et al., 18 Nov 2025).
Hierarchical Sharding in DLRM: Simultaneous table-, row-, and column-wise sharding, combined with data-parallel replication, minimizes per-node memory, balances communication, and matches interconnect and batch shape (Mudigere et al., 2021).
SPD/Design Automation: DSL-based design flows automate the instantiation and cascade of hardware pipelines over spatial and temporal domains, subject to silicon and bandwidth constraints (Sano, 2015).
Photonic Tensor Cores: Meshes and multiplexing within photonic chips directly map neural computations over domain-parallel optical channels (Zhou et al., 31 Dec 2025).

5. Communication, Scheduling, and Load-Balance Considerations

The efficacy of domain-parallel co-design is critically governed by:

Communication Complexity: Balanced process/partition grids minimize all-gather/all-reduce costs; nearest-neighbor (halo) exchanges for domain-parallel convolution are communication-optimal for wide images (Gholami et al., 2017).
Task/Data Placement: Uniform chunking, load-scrambling, and adaptive queue management prevent hot spots and keep hardware utilization near ideal (Orenes-Vera et al., 2022).
Scheduling Algorithms: Greedy, reinforcement-learning, and levelization-based methods efficiently explore vast configuration spaces (>10⁹ options in LLM sharding (Yin et al., 29 Aug 2025)).
Memory Bandwidth and Locality: FPGAs and PIMs expose domain parallelism only up to the point where memory bandwidth or on-chip resources saturate; algorithmic tunings (windowing, divide-and-conquer) ensure DP sections fit in fast local SRAM (Cali, 2021, Sano, 2015).
Trade-Off Analysis: Increasing spatial parallelism can overrun external bandwidth, while temporal pipelining is limited by pipeline fill time and on-chip resource pressure (Sano, 2015, Cali, 2021). Photonic parallelism is limited by phase/SNR variation and domain crosstalk as channel count increases (Zhou et al., 31 Dec 2025).

6. Lessons and Future Directions

Collective results from these co-design efforts yield several broad conclusions:

Co-optimizing both coarse (parallelism degrees) and fine (operator/dimension sharding) strategies uncovers non-trivial configurations with superior throughput and load balance, beyond the reach of static heuristics.
Automated, learning-based, or combinatorial search methods (RL, greedy, DSE) are essential to rapidly locate near-optimal sharding and partitioning in the ultra-large design spaces characteristic of modern distributed, domain-parallel systems (Yin et al., 29 Aug 2025, Agrawal et al., 2023).
Cross-layer co-design, as instantiated in frameworks like SimPhony, allows accurate hardware-algorithm codesign for photonic processors, closing the gap between device physics and ML workload mapping (Zhou et al., 31 Dec 2025).
Hierarchical parallelism extraction (loop/task/pipeline), particularly for domain-specific accelerators, raises achievable speedup significantly over simpler block-level or single-axis parallelism (Zacharopoulos et al., 2022).
Domain parallelism generalizes to a wide spectrum of computational modalities (logical/continuous, digital/analog, classical/quantum), often requiring domain-specific analyses of partitioning shapes, communication topologies, and task heterogeneity.

A plausible implication is that as models, datasets, and hardware become increasingly heterogeneous and large-scale, research and practice in automated, adaptable, and domain-parallel co-design frameworks will continue to gain central importance for achieving both performance and efficiency at system scale.