FastGen Framework: Optimized ML & LLM Systems
- FastGen Framework is a dual-purpose system that integrates a fast C++ optimization module in mlpack with a scalable LLM serving system using DeepSpeed.
- It employs advanced compile-time techniques and dynamic prompt scheduling to maximize performance and reduce latency in diverse machine learning tasks.
- The framework supports a wide range of applications, from supervised learning and matrix completion to real-time text generation for interactive AI deployments.
The FastGen Framework refers to two distinct systems in the research literature: (1) the generic and fast C++ optimization framework that underpins the mlpack machine learning library (1711.06581), and (2) DeepSpeed-FastGen, a high-throughput text generation system for LLMs built atop DeepSpeed-MII and DeepSpeed-Inference (2401.08671). Each serves a different technical purpose, but both are unified by the goal of efficiently supporting scalable, flexible, high-performance computation in applied machine learning settings.
1. Architectures and Core Design Principles
mlpack FastGen (C++ Optimization Framework)
The mlpack optimization framework is structured around a policy-based and template metaprogramming architecture. This design facilitates generic programming in C++, allowing the compiler to generate highly optimized code paths. The primary interfaces are the FunctionType
and Optimizer
APIs:
- FunctionType API: Objective functions implement methods such as
Evaluate()
,Gradient()
, and (for separable/partially differentiable cases)PartialGradient()
,NumFunctions()
, andNumFeatures()
. - Optimizer API: Every optimizer provides a single entry point via the
Optimize()
method.
The system supports a broad class of objective functions, including separable/nonseparable, sparse/dense, constrained/unconstrained, and partially differentiable functions. This architecture yields high extensibility and runtime efficiency by exploiting static (compile-time) resolution over dynamic polymorphism, eliminating the overhead of virtual calls.
DeepSpeed-FastGen (High-Throughput LLM Serving)
DeepSpeed-FastGen is a serving system that combines DeepSpeed-MII—offering access to an extensive HuggingFace model zoo and tokenizers—with DeepSpeed-Inference, supplying advanced inference kernels and optimized memory management. The architecture leverages continuous batching, hardware utilization maximization, and an efficient blocked key-value caching scheme to serve LLMs at production scale. A central innovation is the Dynamic SplitFuse prompt and generation composition strategy, which orchestrates prompt handling and generation to suit variable input lengths and batch conditions.
2. Methodological Innovations
mlpack FastGen
- Generic Objective Function-Optimizer Pairing: Through template metaprogramming, arbitrary user-defined objective functions and optimizers can be paired, as long as API contracts are respected. This facilitates rapid experimentation and prototyping.
- Compile-Time Interface Checking: Techniques such as SFINAE and
static_assert
provide guarantees that required methods (e.g.,Gradient()
) are implemented by compile-time, not runtime. This reduces debugging complexity and ensures minimal performance loss. - Separable and Nonseparable Support: The dual
Evaluate()
interface (with/without batching parameters) bridges algorithms expecting either separable or nonseparable loss formulations. - Sparse and Dense Gradients: Templates allow optimizers to accept both dense (e.g.,
arma::mat
) and sparse (e.g.,arma::sp_mat
) gradients, extending applicability to various memory and speed-sensitive contexts.
DeepSpeed-FastGen
- Dynamic SplitFuse: The method dynamically splits long prompts into smaller chunks across multiple forward passes, while short prompts are fused to reach a target token budget. Only the final pass performs generation, minimizing outlier latency for long inputs and maintaining optimally-sized workloads for high-throughput operation.
- Scheduling Insight: The framework exploits the concavity of latency as a function of token length. The mathematical property—
—justifies even splitting for prompt processing, underpinning the SplitFuse strategy.
3. Implementation Details
mlpack FastGen
Objective functions require at minimum the implementation of an Evaluate()
method:
1 2 |
double Evaluate(const arma::mat& parameters); // Nonseparable double Evaluate(const arma::mat& parameters, size_t start, size_t batchSize); // Separable |
1 2 |
template<typename GradType> void Gradient(const arma::mat& parameters, GradType gradient); |
1 2 |
template<typename FunctionType> double Optimize(FunctionType function, arma::mat parameters); |
DeepSpeed-FastGen
Deployment supports two principal modes:
- Non-persistent pipeline: Quick, interactive usage for session-based or exploratory tasks—model instantiation is handled within the Python script’s lifetime.
- Persistent deployment: Suited to production, with a lightweight gRPC server and built-in load balancing for scalable, long-running services. Persistent serving accommodates concurrency and multiple clients.
The code base is accessible via GitHub, supporting installation through standard Python tooling. FastGen integrates seamlessly with HuggingFace model families such as LLaMA, LLaMA-2, Mistral, and Facebook OPT.
4. Performance and Comparative Evaluation
mlpack FastGen
Benchmarks demonstrate performance on par with hand-optimized implementations, with negligible runtime overhead due to compile-time code generation. The framework is compared against TensorFlow, Caffe, scikit-learn, SciPy, and MATLAB, supporting a broader spectrum of optimizers (gradient-based and population-based) and objective function types. Static compile-time checks further reduce runtime error incidence.
DeepSpeed-FastGen
Key empirical claims include:
- Up to 2.3x higher effective throughput.
- 2x lower average latency.
- Up to 3.7x lower (token-level) tail latency relative to vLLM.
Benchmarking adopts two main strategies:
- Throughput-latency measurement: Varying client concurrency and request size, latency and throughput are plotted across Llama-2 7B/13B/70B models and hardware platforms including A100, H100, and A6000 GPUs.
- Effective throughput: End-to-end prompt and generation Service Level Agreements (SLAs) define real-time usability from initial token to sustained streaming. Load balancing experiments using multiple replicas (e.g., 16x) confirm near-linear scalability.
5. Applications and Deployment Scenarios
mlpack FastGen
The framework is applicable across a spectrum of optimization tasks in machine learning, including:
- Supervised learning (e.g., logistic regression, neural networks).
- Unsupervised methods (e.g., manifold and metric learning).
- Matrix completion, reinforcement learning, sparse modeling (autoencoders, SVMs).
Its modular design simplifies the introduction of new function types and constraints by implementation of a small number of API methods.
DeepSpeed-FastGen
The serving system is designed for use cases demanding high-throughput, low-latency text generation from LLMs. Applications encompass:
- Interactive chat and real-time AI assistants.
- Batch or streaming inference for large-scale document processing.
- Scalable deployment for LLM-powered web services.
Robust deployment flexibility enables researchers and practitioners to match resource footprint and latency guarantees to application requirements.
6. Ongoing Development and Community Engagement
mlpack FastGen
The generic design is conducive to rapid extension by the research community, especially as new optimization paradigms or function forms emerge. User-written code integrates seamlessly provided compliance with the minimal interface requirements is maintained.
DeepSpeed-FastGen
Development is ongoing, with future enhancements promised in:
- Improved performance via kernel and scheduling optimizations.
- Additional model family support beyond current HuggingFace architectures.
- Broader hardware backend compatibility.
- Expanded benchmarking for external validation and transparency.
The open-source repository enables active community participation, with contributions, issue reports, and feedback encouraged.
7. Mathematical Formulation and Optimization Criteria
Both frameworks are grounded in classic continuous optimization, formalized as:
For separable objectives:
These formulations guide implementation, particularly the handling of mini-batches and distributed computation in both systems. In the case of DeepSpeed-FastGen, concavity properties of latency functions inform the scheduling and batching mechanisms that yield observed throughput and latency gains.
In summary, the FastGen frameworks embodied by mlpack’s C++ infrastructure (1711.06581) and DeepSpeed-FastGen’s LLM serving system (2401.08671) exemplify generically extensible, high-performance computational design for modern machine learning applications. Each achieves its respective goals through careful architectural choices, well-defined interfaces, and a consistent focus on both runtime efficiency and deployment flexibility.