Papers
Topics
Authors
Recent
Search
2000 character limit reached

ParallelLinear: LP & SMoE Primitives

Updated 30 January 2026
  • ParallelLinear is a computational primitive that parallelizes linear operations in both parametric LP and SMoE architectures.
  • In parametric LP, it efficiently partitions the parameter space into polyhedral regions using task-based parallelization and lock-free data structures.
  • For SMoE models, it fuses scatter/gather with batched GEMM to reduce memory usage and boost inference and training speed.

ParallelLinear refers to a family of computational primitives designed to parallelize linear operations in both classical parametric linear programming and modern sparse Mixture-of-Experts (SMoE) neural architectures. Although sharing a common name, these components address disparate problem domains: high-dimensional multiparametric optimization for polyhedral computations (Coti et al., 2019) and efficient large-scale neural inference and training on heterogeneous GPU architectures (Tan et al., 2024). Both implementations are characterized by their ability to fuse logical operations with parallel scheduling, enabling high-throughput execution and reduced memory footprint.

1. Definition and Core Functionality

ParallelLinear in parametric linear programming is a C++ component used to solve multiparametric linear programs (LPs). It implements a task-based parallelization strategy for computing the partition of the parameter space Rk\mathbb{R}^k into polyhedral regions, each corresponding to a distinct LP optimal basis. The ultimate goal is to enumerate these regions efficiently given the LP's affine parameter dependence in both objective and constraints.

In SMoE deep learning, ParallelLinear refers to a PyTorch/Triton primitive that fuses scatter/gather data movement with batched general matrix-matrix multiplication (GEMM), supporting operations across a large set of model "experts" simultaneously. This enables memory- and compute-efficient inference/training for models with hundreds or thousands of experts, often with sparse routing.

2. High-Level Architecture

Parametric LP (C++)

  • Solver Backends: Selectable sequential, OpenMP tasking, or Intel TBB.
  • Logical Modules:
    • LP solver wrappers (fast floating-point and exact rational simplex)
    • Region representations (polyhedral inequalities, basis indices, optimal point)
    • Lock-free, thread-safe task queue and region storage
  • API: Core methods include initialize, add_initial_task (with a parameter seed), solve, and get_regions.
  • Concurrency: Thread-safe arrays and hash tables for regions and explored bases; atomic operations ensure task/region consistency (Coti et al., 2019).

SMoE GPU Kernel (PyTorch/Triton)

  • Core Kernel: scatter2scatter kernel implements fused expert-wise GEMM with index-based scatter/gather in a single GPU pass, controlled by the ParallelLinear wrapper.
  • API Parameters: Input X, stacked expert weights W, expert routing order o, routing weights p, plus flags for grouped input/output.
  • Ordering Modes: All four permutations of input/output (grouped or scatter order) are supported, minimizing or eliminating explicit tensor reordering.
  • Buffer Management: In-place buffer reuse across forward/backward passes, reducing transient memory requirements (Tan et al., 2024).

3. Parallelization and Kernel Algorithms

Task-Based Parallelization (Classical LP)

Tasks represent regions of parameter space to be explored. Each thread repeatedly pops tasks, solves the LP at a parameter point (using a floating or exact solver), computes region-defining inequalities from reduced costs, spawns new tasks for adjacent regions, and records explored regions/bases in concurrent data structures. Workers continue until the region graph is fully explored. Dynamic scheduling and lockless data structures ensure high thread utilization (Coti et al., 2019).

Fused Expertwise GEMM (SMoE)

The scatter2scatter kernel interleaves the following:

  1. Scatter reads: Index input tokens per expert routing.
  2. Batched expert GEMM: For each token, select the corresponding expert's weights; perform XtWeX_t \cdot W_e.
  3. Gather writes: Store result in grouped or scatter layout, as required. Shared memory caching and vectorized loads hide memory latency; explicit per-expert buckets and padding are avoided, minimizing transient allocations (Tan et al., 2024).

1
2
3
4
5
6
7
8
9
kernel scatter2scatter(X_ptr, W_ptr, order_ptr, out_ptr, T, in, out, e_stride, k, grouped_in, grouped_out):
    t_idx = program_id(0) * BLOCK_T + local_thread_id()
    if t_idx < T*k:
        expert_id = order_ptr[t_idx]
        x_ptr = X_ptr + (grouped_in ? grouped_offset[t_idx] : t_idx) * in
        w_ptr = W_ptr + expert_id * in * out
        y_val = vector_matmul(load(x_ptr, in), load(w_ptr, in×out))
        dst = out_ptr + (grouped_out ? grouped_offset_out[t_idx] : t_idx) * out
        store(y_val, dst, out)

4. Mathematical Formulation and API Semantics

Parametric LP

The optimization problem is posed as: minx c(λ)x subject to Ax=b(λ),x0 c(λ)=c0+i=1kλici,b(λ)=b0+i=1kλibi\begin{aligned} & \min_x \ c(\lambda)^\top x \ & \text{subject to } \quad A x = b(\lambda), \quad x \geq 0 \ & c(\lambda) = c_0 + \sum_{i=1}^k \lambda_i c_i, \quad b(\lambda) = b_0 + \sum_{i=1}^k \lambda_i b_i \end{aligned} As λ\lambda varies, the region decomposition is driven by “sign-conditions” on reduced costs, defining polyhedral cells in λ\lambda-space (Coti et al., 2019).

SMoE Forward/Backward Passes

For token tt, top-kk expert routing, and weights pt,rp_{t, r}: Yt=r=1kpt,r Y^tk+rY_t = \sum_{r=1}^k p_{t, r} \ \hat{Y}_{t \cdot k + r} where

Y^s,e=fe(Xt)=XtWe\hat{Y}_{s, e} = f_e(X_t) = X_t \cdot W_e

Backward:We=s:order[s]=eXt(s)Y^s\text{Backward:} \quad \nabla W_e = \sum_{s: order[s]=e} X_{t(s)}^\top \cdot \nabla \hat{Y}_s

The routine avoids explicit T×k×in input grouping and leverages in-place buffer reuse (Tan et al., 2024).

ParallelLinear Domain Function Data Structure(s)
Parametric LP Region enumeration, facet exploration Regions[], Task queue, BasisID hash
SMoE (ScatterMoE) Fused expertwise matrix multiply and routing order[], routing weights, shared GPU buffer

5. Memory and Performance Characteristics

ParallelLinear implementations are optimized for both compute and memory:

  • No Padding: In SMoE, experts process only active tokens without block-sparse or padded tensors, yielding 30–50% memory reduction at high expert counts.
  • No Explicit Copy: Scatter/gather is fused; the original input tensor is accessed directly, further reducing memory footprint (Tan et al., 2024).
  • Buffer Reuse: Forward and backward workspaces are shared, reducing peak allocation by ~20% over standard group–compute–scatter approaches.
  • Scaling and Throughput: In SMoE, throughput gains of 20–40% are observed over Megablocks baselines; in parametric LP, near-linear speedup is achieved up to hardware thread count (e.g., wall-clock time improvement from t1t_1 to t1/38t_1/38 on a 40-thread configuration for problems with 3700\sim 3\,700 regions) (Coti et al., 2019).

6. Applications

  • Polyhedral Computations: ParallelLinear is integral to projection (Fourier–Motzkin alternative) and convex hull computations via parametric LP. The component supports elimination steps by chaining region enumeration for successive parametric constraints (Coti et al., 2019).
  • Multiparametric Model Predictive Control (MPC): Used for explicit controller synthesis where the state space is partitioned into affine regions.
  • Sparse Mixture-of-Experts Models: ParallelLinear underpins efficient SMoE layers in ScatterMoE, supporting large models in both MLP and attention modules without intermediate padding or data reordering (Tan et al., 2024).
  • Mixture-of-Attention Layer Construction: The API supports direct implementation of SMoE multi-head attention with up to 30% inference speedup at high granularity.

7. Limitations and Trade-offs

  • Implementation Complexity: The fused scatter/gather plus GEMM kernels (especially in Triton) require careful tuning to avoid memory bottlenecks when expert routing is not locally clustered. However, routing algorithms in SMoE models generally favor clustered allocation, so this is rarely a practical concern (Tan et al., 2024).
  • Scaling Behavior: In parametric LP, parallel efficiency is bounded by the number of distinct optimal bases (N) and the region-adjacency graph width. Smaller problems may not saturate available cores; formal complexity bounds are not generally available due to exponential scaling in the number of regions with parameter dimension (Coti et al., 2019).

References

  • "Parallel parametric linear programming solving, and application to polyhedral computations" (Coti et al., 2019)
  • "Scattered Mixture-of-Experts Implementation" (Tan et al., 2024)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ParallelLinear Component.