FlexGen: Scalable, Flexible Generation
- FlexGen is a dual-purpose framework that enables scalable generative computation for both large language models and multi-view image synthesis.
- It employs techniques like resource-aware offloading, block-zigzag scheduling, and LP-based optimization to achieve high-throughput inference on commodity hardware.
- The framework integrates GPT-4V-based 3D-aware captioning with adaptive dual-control attention to deliver controllable and consistent multi-view image outputs.
FlexGen refers to two distinct frameworks, each representing significant advances in high-throughput generative inference and controllable multi-view image synthesis. The first, originating in the context of LLMs, addresses efficient inference on commodity hardware through resource-aware offloading and quantization. The second is a text/image-conditioned diffusion framework for multi-view synthesis, leveraging 3D-aware caption generation to enable structured, controllable outputs. Both instantiate the core principle of flexible generative computation under hardware and modality constraints, albeit in different modalities and technical settings.
1. High-Throughput Inference for LLMs (Sheng et al., 2023)
FlexGen enables inference for transformer-based LLMs with hundreds of billions of parameters on a single commodity GPU by orchestrating tensor placement and compute across a hierarchical memory system comprising GPU RAM, CPU DRAM, and NVMe disk. This design extends LLM inferencing to hardware previously incapable of supporting such model scales, with targets like the 175B-parameter OPT model deployable on 16 GB NVIDIA T4 GPUs.
Memory Hierarchy and Offloading
The FlexGen system abstracts device resources into a three-level hierarchy:
- GPU memory: Fastest, but typically the most constrained in size.
- CPU DRAM: Intermediate speed and capacity.
- NVMe disk: Largest, slowest tier.
At runtime, all significant LLM state—weights, intermediate activations, and growing KV attention cache—are partitioned and dynamically offloaded between these tiers. Scheduling overlaps I/O and compute via concurrent CUDA streams and CPU threads, sometimes executing attention on the CPU to minimize large KV cache transfers to the GPU.
Block-Zigzag Scheduling and LP-Based Policy Search
Inference is formulated as scheduling tensor operations on a two-dimensional grid in which rows enumerate transformer layers (1…ℓ), and columns represent token positions (prompt plus generated, 1…s+n). Rather than a naïve row- or column-major traversal, FlexGen employs a “block-zigzag” schedule: processing blocks of prompts at once (block size bls) to amortize repeated weight loads across sub-batches. The schedule interleaves weight/cached activation/cache loads and computation, minimizing idle time by overlapping memory and compute operations.
A linear programming (LP) formulation determines optimal data placement across memory tiers (fractions of weights, activations, KV cache resident per device) and schedules per-block operations while observing all device capacity constraints. Given inference hyperparameters and hardware characteristics, the LP minimizes average processing time per prompt under hardware limits: $\begin{array}{rl} \min_{w,h,c}& \dfrac{T_{\mathrm{pre}}\ell + T_{\mathrm{gen}}(n-1)\ell}{\mathrm{bls}} \[6pt] \text{s.t.} & \text{Device memory constraints} \ & w_g+w_c+w_d=1,\;h_g+h_c+h_d=1,\;c_g+c_c+c_d=1 \end{array}$
Quantization and Compression
FlexGen utilizes group-wise asymmetric quantization for both model weights and the KV attention cache. Each contiguous group of elements () is quantized to 4 bits per element. This compression, applied along output channels and vector dimensions, yields 0.2% accuracy drop (LAMBADA benchmark), and, when combined with I/O-efficient scheduling, enables up to reduction in memory footprint.
Empirical Performance
On a single T4 (16 GB) with 1.5 TB NVMe, FlexGen achieves:
- 0.69 tokens/s for OPT-175B (prompt length 512, gen length 32) without compression; 1.12 tokens/s with 4-bit quantization.
- Baseline comparison: 100× throughput improvement over HuggingFace Accelerate and DeepSpeed Zero-Inference at similar latency.
- Batch size up to 256 (vs 1–2 for baselines).
- Super-linear pipeline parallel scaling across multiple GPUs, e.g., 3.86 tokens/s for OPT-175B on 4 GPUs.
- On HELM (OPT-IML-30B), full 7-subtask run on a single T4 in 21 hours (7k sequences).
Implementation and Usage
The system is implemented in PyTorch 1.13+ with key modules for LP policy search, runtime scheduling, memory paging, quantization, and model wrapping. Tensors are managed with mmap'd files and Linux page cache management to avoid host-level caching artifacts. Users can access FlexGen via a CLI or Python API, configuring key parameters to trade off latency versus throughput, e.g.:
1 2 3 4 5 6 7 8 9 10 11 |
from flexgen.runtime import run_generation config = { "model_name": "facebook/opt-175b", "prompt_file": "prompts.txt", "output_file": "outputs.txt", "gpu_batch_size": 32, "num_gpus": 1, "offload_policy": "auto", "compression_bits": 4 } run_generation(**config) |
2. Flexible Multi-View Image Synthesis from Text and Images (Xu et al., 2024)
A subsequent FlexGen framework addresses the problem of controlled, consistent multi-view image synthesis from arbitrary text prompts, single-view images, or both. This system integrates 3D-aware captioning with compositional conditioning for image generation, built atop a Stable Diffusion 2.1 backbone.
Multi-Input Conditioning and 3D-Aware Annotation
The framework supports three conditioning modes: IMulti-View (image-only input), TMulti-View (text-only input), and I+TMulti-View (both). Output is a tiled image containing four orthogonal 512512 renderings (front, left, back, right; fixed elevation 5°, 90° azimuth increments).
A distinctive component is offline generation of 3D-aware captions via GPT-4V:
- For each 3D object (from Objaverse), four rendered views are tiled and provided to GPT-4V, which emits both a “global caption” (overall object description, material) and structured “local captions” (attributes and geometry of sub-regions, e.g., “top left knob is silver”).
- These captions are merged to form composite prompts used as conditional input during diffusion model training.
Adaptive Dual-Control Module
At each UNet attention layer, FlexGen fuses: (i) keys/values from a reference image (self-attention), and (ii) keys/values from CLIP-encoded prompt embeddings (cross-attention). An adaptive gating network modulates the mixture coefficient per block, producing fused keys/values:
Attention is computed over this fused space, balancing textual and visual control flexibly.
A “condition switcher” during training randomly ablates image or text, enabling a unified model to handle all conditioning configurations.
Training and Inference Methodology
The diffusion process is standard, defined as:
- Forward pass adds Gaussian noise to the image; conditioning is injected through fused attention at all layers.
- The denoising score-matching loss: Condition includes both image reference and text embedding, where present.
Training is performed on 147k Objaverse models (24 views each; 4 for captioning), using 8×A800 GPUs, batch size 32, Adam optimizer at lr, for 180k iterations. Sampling uses DDIM with 75 steps; condition probabilities balance three modes (image+text, image only, text only) with 10% no-conditioning for robustness.
Controllability Mechanisms
FlexGen supports extensive controllability via direct prompt/caption editing:
- Unseen-part control: Modifying or appending local captions directs generation of occluded geometry (“the back right panel has a red stripe”).
- Material and texture control: Explicit material tags (e.g., “high metallic”, “low roughness”) ensure physically consistent renderings aligned with Objaverse mesh attributes.
- Part-level control: Local captions allow part-specific edits (e.g. “the handle is matte black”), realized through cross-attention mechanisms.
- View consistency: Emanates from the joint grid-based diffusion architecture and adaptive fusion; no auxiliary loss is used.
Experimental Results
On Google Scanned Objects (GSO), FlexGen surpasses prior baselines (Zero123++, Era3D, SyncDreamer, MVDream) on:
- Single-viewmulti-view: PSNR 22.31, LPIPS 0.12, Chamfer Distance 0.076, F1 ([email protected]) 0.928.
- Textmulti-view: FID 35.56, Inception Score 13.41, CLIP score 0.83.
Ablation studies indicate that GPT-4V-based captions are critical for effective unseen-part synthesis and overall quantitative metrics. Removing adaptive text/image fusion significantly degrades controllability, particularly for material edits.
Implementation and Availability
The reference implementation targets PyTorch with the latent UNet backbone from Stable Diffusion 2.1, GPT-4V for offline captioning, and CLIP text encoders for embedding prompts. Training leverages data augmentation by random masking of conditioning modalities; inference accepts images, text, or both and outputs coherent view-grids in seconds to minutes, scalable with hardware.
3. Comparative Summary Table
| Aspect | FlexGen (LLM Inference) (Sheng et al., 2023) | FlexGen (Multi-View Synthesis) (Xu et al., 2024) |
|---|---|---|
| Primary Domain | LLM inference on constrained GPU | Controllable multi-view image synthesis |
| Core Technical Approach | Resource-aware offloading, quantization, LP scheduling | Diffusion with adaptive text/image fusion, GPT-4V captions |
| Input Modalities | Text prompts | Image, text, or both |
| Output | Generated text tokens | multi-view image grid (front/left/back/right) |
| Main Innovations | Three-tier offloading, block-zigzag policy, LP optimization | 3D-aware captioning via GPT-4V, adaptive dual-control attention |
| Implementation | PyTorch, CUDA streams, mmap, quantization routines | PyTorch, Stable Diffusion 2.1, CLIP, GPT-4V |
4. Significance in Resource-Constrained and Controllable Generation
FlexGen exemplifies advanced strategies for scalable generative modeling in settings where direct deployment of large models would otherwise be infeasible. In LLM inference, offloading plus quantization prolongs the utility of commodity hardware, democratizing access to high-complexity models previously restricted by memory cost. In multi-view synthesis, explicit 3D-awareness via cross-modal captioning coupled with fused attention extends the state of the art in controllable and semantically consistent scene generation.
5. Future Directions and Implications
The approaches instantiated in both FlexGen variants suggest generalizable principles for scalable generative modeling: hardware-adaptive scheduling, memory-efficient quantization, and multi-modal control with structured natural language prompts. Potential expansions include online or incremental captioning, streaming inference for either modality, and further unification of adaptive control strategies across text, image, and 3D generation, particularly as modality-agnostic generative models become more prevalent. These advances have implications not only for exceptional throughput and flexibility in human-computer interaction (LLMs) but also for rapid content prototyping in animation, game development, and virtual/augmented reality contexts.
FlexGen’s frameworks, in both domains, demonstrate that principled system-level and algorithmic innovations can overcome resource bottlenecks and control limitations in modern generative modeling (Sheng et al., 2023, Xu et al., 2024).