Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlexPipe: Adaptive Pipeline Frameworks

Updated 9 February 2026
  • FlexPipe is a suite of adaptable computational frameworks that decompose pipelines for LLM inference, distributed DNN training, and fluid-structure interaction simulations.
  • The system employs fine-grained model partitioning, inflight refactoring, and topology-aware resource allocation to optimize latency and throughput in heterogeneous, bursty environments.
  • FlexPipe also offers programmable scheduling via a domain-specific language and robust simulation methods, facilitating rapid scalability and precise performance tuning.

FlexPipe denotes several advanced computational frameworks in distinct domains, each unified by the underlying concept of flexible, fine-grained pipeline decomposition and optimization. The term encompasses: (1) a dynamic LLM inference system for serverless clusters (Lin et al., 13 Oct 2025); (2) a programmable pipeline scheduling framework for distributed DNN training (Jiang et al., 27 Sep 2025); and (3) an open-source simulation package for fluid-structure interaction in vortex-induced vibration of flexible pipes (Fu et al., 9 Feb 2025). Each addresses the need for adaptability and efficiency in highly variable, heterogeneous, or physically complex environments.

1. Dynamic LLM Serving with Inflight Pipeline Refactoring

FlexPipe (Lin et al., 13 Oct 2025) targets the inefficiencies in serving LLMs in cloud-native, serverless GPU clusters, where resource fragmentation and bursty workloads are prevalent. The paradigm of static pipeline configuration is supplanted by three fundamental mechanisms:

i. Fine-Grained Model Partitioning:

LLMs are decomposed into a DAG G=(V,E)G=(V,E) of operators. The partitioning problem seeks KK sequential pipeline stages (S1,,SK)(S_1,\dots,S_K) minimizing

minS1SKk=1Ktc(Sk)+sp(Sk)/BC+λR(Sk)\min_{S_1\dots S_K} \sum_{k=1}^K |t_c(S_k) + s_p(S_k)/B - C| + \lambda \cdot R(S_k)

subject to stage exclusivity and GPU memory constraints: kSk=V;SiSj=;maxksp(Sk)MGPU\bigcup_k S_k = V;\quad S_i\cap S_j=\varnothing;\quad \max_k s_p(S_k)\leq M_{GPU} where tc(Sk)t_c(S_k) is total stage compute, sp(Sk)s_p(S_k) is parameter size, BB inter-stage bandwidth, CC the overlap latency target, and R(Sk)R(S_k) a refactoring penalty. This enables pipeline cuts at natural block boundaries, supporting refactoring agility.

ii. Inflight Pipeline Refactoring:

FlexPipe continually profiles arrival-rate CV (νt\nu_t) and queue state. A candidate set G\mathcal{G} of granularities (ηk,bk)(\eta_k, b_k) (stages, batch size) is maintained, and at each epoch, the objective

g=argmaxgG[α(Tg/Tmax)+(1α)(Lmin/Lg)]exp(νtνg/σ)g^* = \arg\max_{g\in\mathcal{G}} [\alpha (T_g/T_{max}) + (1-\alpha)(L_{min}/L_g)]\cdot \exp(-|\nu_t-\nu_g|/\sigma)

selects the optimal configuration by pre-profiled νg\nu_g, throughput TgT_g, and latency LgL_g. When ggcurrentg^*\neq g_{current}, inflight reconfiguration—parameter migration and KV-cache handoff via mask-based selective sync—is overlapped with inference (reconfiguration time 5\lesssim5ms).

iii. Topology-Aware Resource Allocation:

Pipeline-to-GPU mapping is cast as a binary optimization over placements xijx_{ij}, maximizing

i=1Nj=1G[Tij/mjγ(CVi)1(ixij>1)]\sum_{i=1}^N \sum_{j=1}^G \left[T_{ij}/m_j - \gamma(\mathrm{CV}_i)\cdot \mathbf{1}\left(\sum_{i'} x_{i'j} > 1\right)\right]

subject to memory (ixijmjMj\sum_i x_{ij}m_j\leq M_j) and load-balance (TijTijϵ|T_{ij}-T_{i'j'}|\leq\epsilon). The penalty γ(CVi)=γ0(1+αCVi2)\gamma(\mathrm{CV}_i)=\gamma_0(1+\alpha \mathrm{CV}_i^2) discourages high-CV multiplexing. Placement exploits integer programming and hierarchical greedy heuristics.

2. Evaluation in Production-Scale Clusters

FlexPipe is evaluated on a 42-server, 82xA100 GPU Kubernetes cluster against state-of-the-art LLM serving stacks. Under stable workloads (CV=1\mathrm{CV}=1), FlexPipe delivers 38.3% lower end-to-end latency and 54.8% lower queue time, with same throughput (\approx12k req/s), compared to AlpaServe and ServerlessLLM. For bursty workloads (CV=4\mathrm{CV}=4), latency improvements are 66.1% over AlpaServe and 80.6% over MuxServe with <<2% throughput loss.

Resource efficiency (goodput per GPU second) scales with utilization UU: for stable loads, FlexPipe achieves TmaxT_{max} at U33%U\approx33\%, approximately 3×3\times more efficient than AlpaServe. Under highly bursty loads, FlexPipe sustains T=12T=12k req/s at U43%U\approx43\%, an 8.5×8.5\times improvement over Tetris’s best. Recovery from pipeline stalls is rapid (9ms median for CV=4\mathrm{CV}=4, compared to AlpaServe’s 16ms, ServerlessLLM's 50ms, and MuxServe's 48ms). Production rollout reduced GPU reservation wait time by 85%, instance initialization latency by 72%, and always-on GPU reservation from 45% to 30% of peak (Lin et al., 13 Oct 2025).

3. Programmable Pipeline Parallelism for DNN Training

A distinct FlexPipe framework (Jiang et al., 27 Sep 2025) addresses flexible pipeline-parallel training for DNNs. Traditional frameworks constrain the user to a small set of hand-coded schedules (e.g., 1F1B, interleaved, V-shape), limiting adaptability and requiring extensive manual development for new architectures or schedules.

Key components:

The FlexPipe DSL allows users to specify model partitioning, stage mapping, and scheduling by declaring priorities (Computation-Type Traversal Priority, Stage-Traversal Priority) and check functions.

  • Automated Scheduler:

Internally encodes scheduling as a CSSR (Computation Schedule Space Representation), managing an instruction pool, per-actor reorder queues, and a dynamic dependency resolver. Scheduling follows user-specified priorities, emulating known and novel microbatch orderings.

  • Extensible Operations:

Users can register new instruction types (e.g., cross-modal sync), attach them to pipeline stages, and map them to custom PyTorch operations.

  • Overridable Controls:

Functions such as config_inflight_micros() and register_new_priority() allow stage-local or global scheduling invariants to be specified or replaced.

4. Experimental Comparisons and Scalability

FlexPipe (Jiang et al., 27 Sep 2025) demonstrates superior schedule searchability, programmability, and DNN training throughput across transformer and multimodal models. Compared to Megatron-LM and Tessel:

  • Schedule Search:

FlexPipe explores all candidate pipeline placements and priorities in seconds to minutes even for 8\geq8 GPUs, where Tessel requires up to $729$s or times out for 16–32 GPUs.

  • End-to-End Throughput:

For GPT-5B models, FlexPipe achieves up to 1.91×1.91\times throughput over Megatron-LM and 1.49×1.49\times over Tessel for large vocabulary cases (1M tokens). In 16.1B GPT and multimodal (CLIP-style) models, similar gains are observed, with FlexPipe remaining feasible where Megatron-LM exhausts GPU memory.

  • Scaling:

Near-linear throughput scaling is achieved from 16 to 64 GPUs. FlexPipe reduces pipeline “bubble” time (waits due to data or gradient dependencies) by up to 60%. Debugging facilities include reorder-queue tracing, profiling, and offline replay for auto-tuning.

5. Fluid-Structure Interaction Simulation in Flexible Pipes

Another independent FlexPipe system (Fu et al., 9 Feb 2025) targets simulation of vortex-induced vibration (VIV) in flexible pipes under steady flow. The computational approach is characterized by:

  • Fluid Solver:

Incompressible URANS with SST kkω\omega closure is solved in OpenFOAM for each thin longitudinal strip of the pipe.

  • Structural Solver:

Euler–Bernoulli beam theory is discretized via FEM, with two transverse DOF per node.

  • Strip-Theory Decomposition:

The pipe is divided into Ns=20N_s=20 strips, each modeled as a 2D rigid cylinder for fluid force computation, which couples to the beam model through nodal loads.

  • Coupling Algorithm:

Weak (partitioned) coupling advances fluid and structure alternately at each time step in MATLAB, with mesh displacements propagated through OpenFOAM’s dynamic mesh system.

  • Validation:

Simulations for uniform, linear shear, and bidirectional shear flow regimes across Re=104105\mathrm{Re}=10^4-10^5 match experimental amplitudes (AA^* error <<12%), frequencies (ΔSt0.02\Delta St\approx0.02) and dominant vibration modes. Wavelet analysis and spatio-temporal plots robustly distinguish standing and traveling wave patterns in response.

6. Implications, Extensions, and Research Directions

FlexPipe’s architecture for dynamic LLM inference generalizes to other model-serving domains, including transformer variants, diffusion models, GNNs, and heterogeneous hardware (edge GPU/TPU environments). Potential research avenues include online learning of optimal CV thresholds, exploitation of hardware-level fine-grained memory (per-tensor NVRAM), cost/power-aware pipeline co-optimization, advanced queuing theory for dynamic pipelines, and microbatch+CV adaptation for multi-modal/multi-tenant inference (Lin et al., 13 Oct 2025). Programmable pipeline training as introduced by FlexPipe is compatible with rapid model evolution and operator heterogeneity.

The VIV simulation system lays a foundation for more complex FSI analyses and is openly available for reproduction and extension (Fu et al., 9 Feb 2025).

7. Summary Table: Key FlexPipe Domains

Application Domain Primary FlexPipe Paper Core Innovation
Dynamic LLM inference in cloud (Lin et al., 13 Oct 2025) Adaptive inflight pipeline refactoring, topology-aware allocation
Programmable DNN pipeline training (Jiang et al., 27 Sep 2025) DSL-enabled schedule space exploration, automated scheduling
VIV fluid-structure simulation (Fu et al., 9 Feb 2025) URANS–FEM strip coupling, open-source code

FlexPipe thus encapsulates advanced, high-efficiency paradigms for both deep learning system software and computational physics, each exemplifying domain-tailored, adaptable pipeline parallelism and resource allocation.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlexPipe.