ParallelLinear: LP & SMoE Primitives
- ParallelLinear is a computational primitive that parallelizes linear operations in both parametric LP and SMoE architectures.
- In parametric LP, it efficiently partitions the parameter space into polyhedral regions using task-based parallelization and lock-free data structures.
- For SMoE models, it fuses scatter/gather with batched GEMM to reduce memory usage and boost inference and training speed.
ParallelLinear refers to a family of computational primitives designed to parallelize linear operations in both classical parametric linear programming and modern sparse Mixture-of-Experts (SMoE) neural architectures. Although sharing a common name, these components address disparate problem domains: high-dimensional multiparametric optimization for polyhedral computations (Coti et al., 2019) and efficient large-scale neural inference and training on heterogeneous GPU architectures (Tan et al., 2024). Both implementations are characterized by their ability to fuse logical operations with parallel scheduling, enabling high-throughput execution and reduced memory footprint.
1. Definition and Core Functionality
ParallelLinear in parametric linear programming is a C++ component used to solve multiparametric linear programs (LPs). It implements a task-based parallelization strategy for computing the partition of the parameter space into polyhedral regions, each corresponding to a distinct LP optimal basis. The ultimate goal is to enumerate these regions efficiently given the LP's affine parameter dependence in both objective and constraints.
In SMoE deep learning, ParallelLinear refers to a PyTorch/Triton primitive that fuses scatter/gather data movement with batched general matrix-matrix multiplication (GEMM), supporting operations across a large set of model "experts" simultaneously. This enables memory- and compute-efficient inference/training for models with hundreds or thousands of experts, often with sparse routing.
2. High-Level Architecture
Parametric LP (C++)
- Solver Backends: Selectable sequential, OpenMP tasking, or Intel TBB.
- Logical Modules:
- LP solver wrappers (fast floating-point and exact rational simplex)
- Region representations (polyhedral inequalities, basis indices, optimal point)
- Lock-free, thread-safe task queue and region storage
- API: Core methods include initialize, add_initial_task (with a parameter seed), solve, and get_regions.
- Concurrency: Thread-safe arrays and hash tables for regions and explored bases; atomic operations ensure task/region consistency (Coti et al., 2019).
SMoE GPU Kernel (PyTorch/Triton)
- Core Kernel:
scatter2scatterkernel implements fused expert-wise GEMM with index-based scatter/gather in a single GPU pass, controlled by the ParallelLinear wrapper. - API Parameters: Input X, stacked expert weights W, expert routing order o, routing weights p, plus flags for grouped input/output.
- Ordering Modes: All four permutations of input/output (grouped or scatter order) are supported, minimizing or eliminating explicit tensor reordering.
- Buffer Management: In-place buffer reuse across forward/backward passes, reducing transient memory requirements (Tan et al., 2024).
3. Parallelization and Kernel Algorithms
Task-Based Parallelization (Classical LP)
Tasks represent regions of parameter space to be explored. Each thread repeatedly pops tasks, solves the LP at a parameter point (using a floating or exact solver), computes region-defining inequalities from reduced costs, spawns new tasks for adjacent regions, and records explored regions/bases in concurrent data structures. Workers continue until the region graph is fully explored. Dynamic scheduling and lockless data structures ensure high thread utilization (Coti et al., 2019).
Fused Expertwise GEMM (SMoE)
The scatter2scatter kernel interleaves the following:
- Scatter reads: Index input tokens per expert routing.
- Batched expert GEMM: For each token, select the corresponding expert's weights; perform .
- Gather writes: Store result in grouped or scatter layout, as required. Shared memory caching and vectorized loads hide memory latency; explicit per-expert buckets and padding are avoided, minimizing transient allocations (Tan et al., 2024).
1 2 3 4 5 6 7 8 9 |
kernel scatter2scatter(X_ptr, W_ptr, order_ptr, out_ptr, T, in, out, e_stride, k, grouped_in, grouped_out): t_idx = program_id(0) * BLOCK_T + local_thread_id() if t_idx < T*k: expert_id = order_ptr[t_idx] x_ptr = X_ptr + (grouped_in ? grouped_offset[t_idx] : t_idx) * in w_ptr = W_ptr + expert_id * in * out y_val = vector_matmul(load(x_ptr, in), load(w_ptr, in×out)) dst = out_ptr + (grouped_out ? grouped_offset_out[t_idx] : t_idx) * out store(y_val, dst, out) |
4. Mathematical Formulation and API Semantics
Parametric LP
The optimization problem is posed as: As varies, the region decomposition is driven by “sign-conditions” on reduced costs, defining polyhedral cells in -space (Coti et al., 2019).
SMoE Forward/Backward Passes
For token , top- expert routing, and weights : where
The routine avoids explicit T×k×in input grouping and leverages in-place buffer reuse (Tan et al., 2024).
| ParallelLinear Domain | Function | Data Structure(s) |
|---|---|---|
| Parametric LP | Region enumeration, facet exploration | Regions[], Task queue, BasisID hash |
| SMoE (ScatterMoE) | Fused expertwise matrix multiply and routing | order[], routing weights, shared GPU buffer |
5. Memory and Performance Characteristics
ParallelLinear implementations are optimized for both compute and memory:
- No Padding: In SMoE, experts process only active tokens without block-sparse or padded tensors, yielding 30–50% memory reduction at high expert counts.
- No Explicit Copy: Scatter/gather is fused; the original input tensor is accessed directly, further reducing memory footprint (Tan et al., 2024).
- Buffer Reuse: Forward and backward workspaces are shared, reducing peak allocation by ~20% over standard group–compute–scatter approaches.
- Scaling and Throughput: In SMoE, throughput gains of 20–40% are observed over Megablocks baselines; in parametric LP, near-linear speedup is achieved up to hardware thread count (e.g., wall-clock time improvement from to on a 40-thread configuration for problems with regions) (Coti et al., 2019).
6. Applications
- Polyhedral Computations: ParallelLinear is integral to projection (Fourier–Motzkin alternative) and convex hull computations via parametric LP. The component supports elimination steps by chaining region enumeration for successive parametric constraints (Coti et al., 2019).
- Multiparametric Model Predictive Control (MPC): Used for explicit controller synthesis where the state space is partitioned into affine regions.
- Sparse Mixture-of-Experts Models: ParallelLinear underpins efficient SMoE layers in ScatterMoE, supporting large models in both MLP and attention modules without intermediate padding or data reordering (Tan et al., 2024).
- Mixture-of-Attention Layer Construction: The API supports direct implementation of SMoE multi-head attention with up to 30% inference speedup at high granularity.
7. Limitations and Trade-offs
- Implementation Complexity: The fused scatter/gather plus GEMM kernels (especially in Triton) require careful tuning to avoid memory bottlenecks when expert routing is not locally clustered. However, routing algorithms in SMoE models generally favor clustered allocation, so this is rarely a practical concern (Tan et al., 2024).
- Scaling Behavior: In parametric LP, parallel efficiency is bounded by the number of distinct optimal bases (N) and the region-adjacency graph width. Smaller problems may not saturate available cores; formal complexity bounds are not generally available due to exponential scaling in the number of regions with parameter dimension (Coti et al., 2019).
References
- "Parallel parametric linear programming solving, and application to polyhedral computations" (Coti et al., 2019)
- "Scattered Mixture-of-Experts Implementation" (Tan et al., 2024)