ParallelLinear: LP & SMoE Primitives

Updated 30 January 2026

ParallelLinear is a computational primitive that parallelizes linear operations in both parametric LP and SMoE architectures.
In parametric LP, it efficiently partitions the parameter space into polyhedral regions using task-based parallelization and lock-free data structures.
For SMoE models, it fuses scatter/gather with batched GEMM to reduce memory usage and boost inference and training speed.

ParallelLinear refers to a family of computational primitives designed to parallelize linear operations in both classical parametric linear programming and modern sparse Mixture-of-Experts (SMoE) neural architectures. Although sharing a common name, these components address disparate problem domains: high-dimensional multiparametric optimization for polyhedral computations (Coti et al., 2019) and efficient large-scale neural inference and training on heterogeneous GPU architectures (Tan et al., 2024). Both implementations are characterized by their ability to fuse logical operations with parallel scheduling, enabling high-throughput execution and reduced memory footprint.

1. Definition and Core Functionality

ParallelLinear in parametric linear programming is a C++ component used to solve multiparametric linear programs (LPs). It implements a task-based parallelization strategy for computing the partition of the parameter space $\mathbb{R}^k$ into polyhedral regions, each corresponding to a distinct LP optimal basis. The ultimate goal is to enumerate these regions efficiently given the LP's affine parameter dependence in both objective and constraints.

In SMoE deep learning, ParallelLinear refers to a PyTorch/Triton primitive that fuses scatter/gather data movement with batched general matrix-matrix multiplication (GEMM), supporting operations across a large set of model "experts" simultaneously. This enables memory- and compute-efficient inference/training for models with hundreds or thousands of experts, often with sparse routing.

2. High-Level Architecture

Parametric LP (C++)

Solver Backends: Selectable sequential, OpenMP tasking, or Intel TBB.
Logical Modules:
- LP solver wrappers (fast floating-point and exact rational simplex)
- Region representations (polyhedral inequalities, basis indices, optimal point)
- Lock-free, thread-safe task queue and region storage
API: Core methods include initialize, add_initial_task (with a parameter seed), solve, and get_regions.
Concurrency: Thread-safe arrays and hash tables for regions and explored bases; atomic operations ensure task/region consistency (Coti et al., 2019).

SMoE GPU Kernel (PyTorch/Triton)

Core Kernel: scatter2scatter kernel implements fused expert-wise GEMM with index-based scatter/gather in a single GPU pass, controlled by the ParallelLinear wrapper.
API Parameters: Input X, stacked expert weights W, expert routing order o, routing weights p, plus flags for grouped input/output.
Ordering Modes: All four permutations of input/output (grouped or scatter order) are supported, minimizing or eliminating explicit tensor reordering.
Buffer Management: In-place buffer reuse across forward/backward passes, reducing transient memory requirements (Tan et al., 2024).

3. Parallelization and Kernel Algorithms

Task-Based Parallelization (Classical LP)

Tasks represent regions of parameter space to be explored. Each thread repeatedly pops tasks, solves the LP at a parameter point (using a floating or exact solver), computes region-defining inequalities from reduced costs, spawns new tasks for adjacent regions, and records explored regions/bases in concurrent data structures. Workers continue until the region graph is fully explored. Dynamic scheduling and lockless data structures ensure high thread utilization (Coti et al., 2019).

Fused Expertwise GEMM (SMoE)

The scatter2scatter kernel interleaves the following:

Scatter reads: Index input tokens per expert routing.
Batched expert GEMM: For each token, select the corresponding expert's weights; perform $X_t \cdot W_e$ .
Gather writes: Store result in grouped or scatter layout, as required. Shared memory caching and vectorized loads hide memory latency; explicit per-expert buckets and padding are avoided, minimizing transient allocations (Tan et al., 2024).

kernel scatter2scatter(X_ptr, W_ptr, order_ptr, out_ptr, T, in, out, e_stride, k, grouped_in, grouped_out):
    t_idx = program_id(0) * BLOCK_T + local_thread_id()
    if t_idx < T*k:
        expert_id = order_ptr[t_idx]
        x_ptr = X_ptr + (grouped_in ? grouped_offset[t_idx] : t_idx) * in
        w_ptr = W_ptr + expert_id * in * out
        y_val = vector_matmul(load(x_ptr, in), load(w_ptr, in×out))
        dst = out_ptr + (grouped_out ? grouped_offset_out[t_idx] : t_idx) * out
        store(y_val, dst, out)

4. Mathematical Formulation and API Semantics

Parametric LP

The optimization problem is posed as: $\begin{aligned} & \min_x \ c(\lambda)^\top x \ & \text{subject to } \quad A x = b(\lambda), \quad x \geq 0 \ & c(\lambda) = c_0 + \sum_{i=1}^k \lambda_i c_i, \quad b(\lambda) = b_0 + \sum_{i=1}^k \lambda_i b_i \end{aligned}$ As $\lambda$ varies, the region decomposition is driven by “sign-conditions” on reduced costs, defining polyhedral cells in $\lambda$ -space (Coti et al., 2019).

SMoE Forward/Backward Passes

For token $t$ , top- $k$ expert routing, and weights $p_{t, r}$ : $Y_t = \sum_{r=1}^k p_{t, r} \ \hat{Y}_{t \cdot k + r}$ where

$\hat{Y}_{s, e} = f_e(X_t) = X_t \cdot W_e$

$\text{Backward:} \quad \nabla W_e = \sum_{s: order[s]=e} X_{t(s)}^\top \cdot \nabla \hat{Y}_s$

The routine avoids explicit T×k×in input grouping and leverages in-place buffer reuse (Tan et al., 2024).

ParallelLinear Domain	Function	Data Structure(s)
Parametric LP	Region enumeration, facet exploration	Regions[], Task queue, BasisID hash
SMoE (ScatterMoE)	Fused expertwise matrix multiply and routing	order[], routing weights, shared GPU buffer

5. Memory and Performance Characteristics

ParallelLinear implementations are optimized for both compute and memory:

No Padding: In SMoE, experts process only active tokens without block-sparse or padded tensors, yielding 30–50% memory reduction at high expert counts.
No Explicit Copy: Scatter/gather is fused; the original input tensor is accessed directly, further reducing memory footprint (Tan et al., 2024).
Buffer Reuse: Forward and backward workspaces are shared, reducing peak allocation by ~20% over standard group–compute–scatter approaches.
Scaling and Throughput: In SMoE, throughput gains of 20–40% are observed over Megablocks baselines; in parametric LP, near-linear speedup is achieved up to hardware thread count (e.g., wall-clock time improvement from $t_1$ to $t_1/38$ on a 40-thread configuration for problems with $\sim 3\,700$ regions) (Coti et al., 2019).

6. Applications

Polyhedral Computations: ParallelLinear is integral to projection (Fourier–Motzkin alternative) and convex hull computations via parametric LP. The component supports elimination steps by chaining region enumeration for successive parametric constraints (Coti et al., 2019).
Multiparametric Model Predictive Control (MPC): Used for explicit controller synthesis where the state space is partitioned into affine regions.
Sparse Mixture-of-Experts Models: ParallelLinear underpins efficient SMoE layers in ScatterMoE, supporting large models in both MLP and attention modules without intermediate padding or data reordering (Tan et al., 2024).
Mixture-of-Attention Layer Construction: The API supports direct implementation of SMoE multi-head attention with up to 30% inference speedup at high granularity.

7. Limitations and Trade-offs

Implementation Complexity: The fused scatter/gather plus GEMM kernels (especially in Triton) require careful tuning to avoid memory bottlenecks when expert routing is not locally clustered. However, routing algorithms in SMoE models generally favor clustered allocation, so this is rarely a practical concern (Tan et al., 2024).
Scaling Behavior: In parametric LP, parallel efficiency is bounded by the number of distinct optimal bases (N) and the region-adjacency graph width. Smaller problems may not saturate available cores; formal complexity bounds are not generally available due to exponential scaling in the number of regions with parameter dimension (Coti et al., 2019).

References

"Parallel parametric linear programming solving, and application to polyhedral computations" (Coti et al., 2019)
"Scattered Mixture-of-Experts Implementation" (Tan et al., 2024)

Markdown Report Issue Upgrade to Chat

References (2)

Parallel parametric linear programming solving, and application to polyhedral computations (2019)

Scattered Mixture-of-Experts Implementation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ParallelLinear Component.