Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlexGen: Scalable, Flexible Generation

Updated 23 January 2026
  • FlexGen is a dual-purpose framework that enables scalable generative computation for both large language models and multi-view image synthesis.
  • It employs techniques like resource-aware offloading, block-zigzag scheduling, and LP-based optimization to achieve high-throughput inference on commodity hardware.
  • The framework integrates GPT-4V-based 3D-aware captioning with adaptive dual-control attention to deliver controllable and consistent multi-view image outputs.

FlexGen refers to two distinct frameworks, each representing significant advances in high-throughput generative inference and controllable multi-view image synthesis. The first, originating in the context of LLMs, addresses efficient inference on commodity hardware through resource-aware offloading and quantization. The second is a text/image-conditioned diffusion framework for multi-view synthesis, leveraging 3D-aware caption generation to enable structured, controllable outputs. Both instantiate the core principle of flexible generative computation under hardware and modality constraints, albeit in different modalities and technical settings.

FlexGen enables inference for transformer-based LLMs with hundreds of billions of parameters on a single commodity GPU by orchestrating tensor placement and compute across a hierarchical memory system comprising GPU RAM, CPU DRAM, and NVMe disk. This design extends LLM inferencing to hardware previously incapable of supporting such model scales, with targets like the 175B-parameter OPT model deployable on 16 GB NVIDIA T4 GPUs.

Memory Hierarchy and Offloading

The FlexGen system abstracts device resources into a three-level hierarchy:

  • GPU memory: Fastest, but typically the most constrained in size.
  • CPU DRAM: Intermediate speed and capacity.
  • NVMe disk: Largest, slowest tier.

At runtime, all significant LLM state—weights, intermediate activations, and growing KV attention cache—are partitioned and dynamically offloaded between these tiers. Scheduling overlaps I/O and compute via concurrent CUDA streams and CPU threads, sometimes executing attention on the CPU to minimize large KV cache transfers to the GPU.

Inference is formulated as scheduling tensor operations on a two-dimensional grid in which rows enumerate transformer layers (1…ℓ), and columns represent token positions (prompt plus generated, 1…s+n). Rather than a naïve row- or column-major traversal, FlexGen employs a “block-zigzag” schedule: processing blocks of prompts at once (block size bls) to amortize repeated weight loads across sub-batches. The schedule interleaves weight/cached activation/cache loads and computation, minimizing idle time by overlapping memory and compute operations.

A linear programming (LP) formulation determines optimal data placement across memory tiers (fractions of weights, activations, KV cache resident per device) and schedules per-block operations while observing all device capacity constraints. Given inference hyperparameters and hardware characteristics, the LP minimizes average processing time per prompt under hardware limits: $\begin{array}{rl} \min_{w,h,c}& \dfrac{T_{\mathrm{pre}}\ell + T_{\mathrm{gen}}(n-1)\ell}{\mathrm{bls}} \[6pt] \text{s.t.} & \text{Device memory constraints} \ & w_g+w_c+w_d=1,\;h_g+h_c+h_d=1,\;c_g+c_c+c_d=1 \end{array}$

Quantization and Compression

FlexGen utilizes group-wise asymmetric quantization for both model weights and the KV attention cache. Each contiguous group of gg elements (g=64g=64) is quantized to 4 bits per element. This compression, applied along output channels and vector dimensions, yields \leq0.2% accuracy drop (LAMBADA benchmark), and, when combined with I/O-efficient scheduling, enables up to 16×16\times reduction in memory footprint.

Empirical Performance

On a single T4 (16 GB) with 1.5 TB NVMe, FlexGen achieves:

  • 0.69 tokens/s for OPT-175B (prompt length 512, gen length 32) without compression; 1.12 tokens/s with 4-bit quantization.
  • Baseline comparison: 100× throughput improvement over HuggingFace Accelerate and DeepSpeed Zero-Inference at similar latency.
  • Batch size up to 256 (vs 1–2 for baselines).
  • Super-linear pipeline parallel scaling across multiple GPUs, e.g., 3.86 tokens/s for OPT-175B on 4 GPUs.
  • On HELM (OPT-IML-30B), full 7-subtask run on a single T4 in 21 hours (\sim7k sequences).

Implementation and Usage

The system is implemented in PyTorch 1.13+ with key modules for LP policy search, runtime scheduling, memory paging, quantization, and model wrapping. Tensors are managed with mmap'd files and Linux page cache management to avoid host-level caching artifacts. Users can access FlexGen via a CLI or Python API, configuring key parameters to trade off latency versus throughput, e.g.:

1
2
3
4
5
6
7
8
9
10
11
from flexgen.runtime import run_generation
config = {
  "model_name": "facebook/opt-175b",
  "prompt_file": "prompts.txt",
  "output_file": "outputs.txt",
  "gpu_batch_size": 32,
  "num_gpus": 1,
  "offload_policy": "auto",
  "compression_bits": 4
}
run_generation(**config)

A subsequent FlexGen framework addresses the problem of controlled, consistent multi-view image synthesis from arbitrary text prompts, single-view images, or both. This system integrates 3D-aware captioning with compositional conditioning for image generation, built atop a Stable Diffusion 2.1 backbone.

Multi-Input Conditioning and 3D-Aware Annotation

The framework supports three conditioning modes: I\toMulti-View (image-only input), T\toMulti-View (text-only input), and I+T\toMulti-View (both). Output is a 2×22\times2 tiled image containing four orthogonal 512×\times512 renderings (front, left, back, right; fixed elevation 5°, 90° azimuth increments).

A distinctive component is offline generation of 3D-aware captions via GPT-4V:

  • For each 3D object (from Objaverse), four rendered views are tiled and provided to GPT-4V, which emits both a “global caption” (overall object description, material) and structured “local captions” (attributes and geometry of sub-regions, e.g., “top left knob is silver”).
  • These captions are merged to form composite prompts used as conditional input during diffusion model training.

Adaptive Dual-Control Module

At each UNet attention layer, FlexGen fuses: (i) keys/values from a reference image (self-attention), and (ii) keys/values from CLIP-encoded prompt embeddings (cross-attention). An adaptive gating network modulates the mixture coefficient α\alpha per block, producing fused keys/values: Kfuse=[αKimg;1αKtxt]K_{\mathrm{fuse}} = [\sqrt{\alpha}\,K_{\rm img};\,\sqrt{1-\alpha}\,K_{\rm txt}]

Vfuse=[αVimg;1αVtxt]V_{\mathrm{fuse}} = [\sqrt{\alpha}\,V_{\rm img};\,\sqrt{1-\alpha}\,V_{\rm txt}]

Attention is computed over this fused space, balancing textual and visual control flexibly.

A “condition switcher” during training randomly ablates image or text, enabling a unified model to handle all conditioning configurations.

Training and Inference Methodology

The diffusion process is standard, defined as:

  • Forward pass adds Gaussian noise to the image; conditioning is injected through fused attention at all layers.
  • The denoising score-matching loss: L=Et,x0,ϵ[ϵϵθ(xt,t,c)2]\mathcal{L} = \mathbb{E}_{t,x_0,\epsilon}\left[\left\Vert\epsilon - \epsilon_\theta(x_t, t, c)\right\Vert^2\right] Condition cc includes both image reference and text embedding, where present.

Training is performed on 147k Objaverse models (24 views each; 4 for captioning), using 8×A800 GPUs, batch size 32, Adam optimizer at lr=105=10^{-5}, for 180k iterations. Sampling uses DDIM with 75 steps; condition probabilities balance three modes (image+text, image only, text only) with 10% no-conditioning for robustness.

Controllability Mechanisms

FlexGen supports extensive controllability via direct prompt/caption editing:

  • Unseen-part control: Modifying or appending local captions directs generation of occluded geometry (“the back right panel has a red stripe”).
  • Material and texture control: Explicit material tags (e.g., “high metallic”, “low roughness”) ensure physically consistent renderings aligned with Objaverse mesh attributes.
  • Part-level control: Local captions allow part-specific edits (e.g. “the handle is matte black”), realized through cross-attention mechanisms.
  • View consistency: Emanates from the joint grid-based diffusion architecture and adaptive fusion; no auxiliary loss is used.

Experimental Results

On Google Scanned Objects (GSO), FlexGen surpasses prior baselines (Zero123++, Era3D, SyncDreamer, MVDream) on:

  • Single-view\tomulti-view: PSNR 22.31, LPIPS 0.12, Chamfer Distance 0.076, F1 ([email protected]) 0.928.
  • Text\tomulti-view: FID 35.56, Inception Score 13.41, CLIP score 0.83.

Ablation studies indicate that GPT-4V-based captions are critical for effective unseen-part synthesis and overall quantitative metrics. Removing adaptive text/image fusion significantly degrades controllability, particularly for material edits.

Implementation and Availability

The reference implementation targets PyTorch with the latent UNet backbone from Stable Diffusion 2.1, GPT-4V for offline captioning, and CLIP text encoders for embedding prompts. Training leverages data augmentation by random masking of conditioning modalities; inference accepts images, text, or both and outputs coherent view-grids in seconds to minutes, scalable with hardware.

3. Comparative Summary Table

Aspect FlexGen (LLM Inference) (Sheng et al., 2023) FlexGen (Multi-View Synthesis) (Xu et al., 2024)
Primary Domain LLM inference on constrained GPU Controllable multi-view image synthesis
Core Technical Approach Resource-aware offloading, quantization, LP scheduling Diffusion with adaptive text/image fusion, GPT-4V captions
Input Modalities Text prompts Image, text, or both
Output Generated text tokens 2×22\times2 multi-view image grid (front/left/back/right)
Main Innovations Three-tier offloading, block-zigzag policy, LP optimization 3D-aware captioning via GPT-4V, adaptive dual-control attention
Implementation PyTorch, CUDA streams, mmap, quantization routines PyTorch, Stable Diffusion 2.1, CLIP, GPT-4V

4. Significance in Resource-Constrained and Controllable Generation

FlexGen exemplifies advanced strategies for scalable generative modeling in settings where direct deployment of large models would otherwise be infeasible. In LLM inference, offloading plus quantization prolongs the utility of commodity hardware, democratizing access to high-complexity models previously restricted by memory cost. In multi-view synthesis, explicit 3D-awareness via cross-modal captioning coupled with fused attention extends the state of the art in controllable and semantically consistent scene generation.

5. Future Directions and Implications

The approaches instantiated in both FlexGen variants suggest generalizable principles for scalable generative modeling: hardware-adaptive scheduling, memory-efficient quantization, and multi-modal control with structured natural language prompts. Potential expansions include online or incremental captioning, streaming inference for either modality, and further unification of adaptive control strategies across text, image, and 3D generation, particularly as modality-agnostic generative models become more prevalent. These advances have implications not only for exceptional throughput and flexibility in human-computer interaction (LLMs) but also for rapid content prototyping in animation, game development, and virtual/augmented reality contexts.


FlexGen’s frameworks, in both domains, demonstrate that principled system-level and algorithmic innovations can overcome resource bottlenecks and control limitations in modern generative modeling (Sheng et al., 2023, Xu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlexGen Framework.