Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastGen Framework: Optimized ML & LLM Systems

Updated 5 July 2025
  • FastGen Framework is a dual-purpose system that integrates a fast C++ optimization module in mlpack with a scalable LLM serving system using DeepSpeed.
  • It employs advanced compile-time techniques and dynamic prompt scheduling to maximize performance and reduce latency in diverse machine learning tasks.
  • The framework supports a wide range of applications, from supervised learning and matrix completion to real-time text generation for interactive AI deployments.

The FastGen Framework refers to two distinct systems in the research literature: (1) the generic and fast C++ optimization framework that underpins the mlpack machine learning library (1711.06581), and (2) DeepSpeed-FastGen, a high-throughput text generation system for LLMs built atop DeepSpeed-MII and DeepSpeed-Inference (2401.08671). Each serves a different technical purpose, but both are unified by the goal of efficiently supporting scalable, flexible, high-performance computation in applied machine learning settings.

1. Architectures and Core Design Principles

mlpack FastGen (C++ Optimization Framework)

The mlpack optimization framework is structured around a policy-based and template metaprogramming architecture. This design facilitates generic programming in C++, allowing the compiler to generate highly optimized code paths. The primary interfaces are the FunctionType and Optimizer APIs:

  • FunctionType API: Objective functions implement methods such as Evaluate(), Gradient(), and (for separable/partially differentiable cases) PartialGradient(), NumFunctions(), and NumFeatures().
  • Optimizer API: Every optimizer provides a single entry point via the Optimize() method.

The system supports a broad class of objective functions, including separable/nonseparable, sparse/dense, constrained/unconstrained, and partially differentiable functions. This architecture yields high extensibility and runtime efficiency by exploiting static (compile-time) resolution over dynamic polymorphism, eliminating the overhead of virtual calls.

DeepSpeed-FastGen (High-Throughput LLM Serving)

DeepSpeed-FastGen is a serving system that combines DeepSpeed-MII—offering access to an extensive HuggingFace model zoo and tokenizers—with DeepSpeed-Inference, supplying advanced inference kernels and optimized memory management. The architecture leverages continuous batching, hardware utilization maximization, and an efficient blocked key-value caching scheme to serve LLMs at production scale. A central innovation is the Dynamic SplitFuse prompt and generation composition strategy, which orchestrates prompt handling and generation to suit variable input lengths and batch conditions.

2. Methodological Innovations

mlpack FastGen

  • Generic Objective Function-Optimizer Pairing: Through template metaprogramming, arbitrary user-defined objective functions and optimizers can be paired, as long as API contracts are respected. This facilitates rapid experimentation and prototyping.
  • Compile-Time Interface Checking: Techniques such as SFINAE and static_assert provide guarantees that required methods (e.g., Gradient()) are implemented by compile-time, not runtime. This reduces debugging complexity and ensures minimal performance loss.
  • Separable and Nonseparable Support: The dual Evaluate() interface (with/without batching parameters) bridges algorithms expecting either separable or nonseparable loss formulations.
  • Sparse and Dense Gradients: Templates allow optimizers to accept both dense (e.g., arma::mat) and sparse (e.g., arma::sp_mat) gradients, extending applicability to various memory and speed-sensitive contexts.

DeepSpeed-FastGen

  • Dynamic SplitFuse: The method dynamically splits long prompts into smaller chunks across multiple forward passes, while short prompts are fused to reach a target token budget. Only the final pass performs generation, minimizing outlier latency for long inputs and maintaining optimally-sized workloads for high-throughput operation.
  • Scheduling Insight: The framework exploits the concavity of latency as a function of token length. The mathematical property—

0limh0f(x+h)2f(x)+f(xh)h2    2f(x)f(x+h)+f(xh)0 \geq \lim_{h \to 0} \frac{f(x+h) - 2f(x) + f(x-h)}{h^2} \implies 2f(x) \geq f(x+h) + f(x-h)

—justifies even splitting for prompt processing, underpinning the SplitFuse strategy.

3. Implementation Details

mlpack FastGen

Objective functions require at minimum the implementation of an Evaluate() method:

1
2
double Evaluate(const arma::mat& parameters); // Nonseparable
double Evaluate(const arma::mat& parameters, size_t start, size_t batchSize); // Separable
For differentiable objectives:
1
2
template<typename GradType>
void Gradient(const arma::mat& parameters, GradType gradient);
Partial derivatives, sparse gradients, or batch-oriented processing are supported by overloading corresponding methods. An optimizer typically exposes:
1
2
template<typename FunctionType>
double Optimize(FunctionType function, arma::mat parameters);
A prototypical optimizer loop involves parameter initialization, batch-wise evaluation and gradient computation, parameter updates, and periodic reshuffling if the objective is separable.

DeepSpeed-FastGen

Deployment supports two principal modes:

  • Non-persistent pipeline: Quick, interactive usage for session-based or exploratory tasks—model instantiation is handled within the Python script’s lifetime.
  • Persistent deployment: Suited to production, with a lightweight gRPC server and built-in load balancing for scalable, long-running services. Persistent serving accommodates concurrency and multiple clients.

The code base is accessible via GitHub, supporting installation through standard Python tooling. FastGen integrates seamlessly with HuggingFace model families such as LLaMA, LLaMA-2, Mistral, and Facebook OPT.

4. Performance and Comparative Evaluation

mlpack FastGen

Benchmarks demonstrate performance on par with hand-optimized implementations, with negligible runtime overhead due to compile-time code generation. The framework is compared against TensorFlow, Caffe, scikit-learn, SciPy, and MATLAB, supporting a broader spectrum of optimizers (gradient-based and population-based) and objective function types. Static compile-time checks further reduce runtime error incidence.

DeepSpeed-FastGen

Key empirical claims include:

  • Up to 2.3x higher effective throughput.
  • 2x lower average latency.
  • Up to 3.7x lower (token-level) tail latency relative to vLLM.

Benchmarking adopts two main strategies:

  • Throughput-latency measurement: Varying client concurrency and request size, latency and throughput are plotted across Llama-2 7B/13B/70B models and hardware platforms including A100, H100, and A6000 GPUs.
  • Effective throughput: End-to-end prompt and generation Service Level Agreements (SLAs) define real-time usability from initial token to sustained streaming. Load balancing experiments using multiple replicas (e.g., 16x) confirm near-linear scalability.

5. Applications and Deployment Scenarios

mlpack FastGen

The framework is applicable across a spectrum of optimization tasks in machine learning, including:

  • Supervised learning (e.g., logistic regression, neural networks).
  • Unsupervised methods (e.g., manifold and metric learning).
  • Matrix completion, reinforcement learning, sparse modeling (autoencoders, SVMs).

Its modular design simplifies the introduction of new function types and constraints by implementation of a small number of API methods.

DeepSpeed-FastGen

The serving system is designed for use cases demanding high-throughput, low-latency text generation from LLMs. Applications encompass:

  • Interactive chat and real-time AI assistants.
  • Batch or streaming inference for large-scale document processing.
  • Scalable deployment for LLM-powered web services.

Robust deployment flexibility enables researchers and practitioners to match resource footprint and latency guarantees to application requirements.

6. Ongoing Development and Community Engagement

mlpack FastGen

The generic design is conducive to rapid extension by the research community, especially as new optimization paradigms or function forms emerge. User-written code integrates seamlessly provided compliance with the minimal interface requirements is maintained.

DeepSpeed-FastGen

Development is ongoing, with future enhancements promised in:

  • Improved performance via kernel and scheduling optimizations.
  • Additional model family support beyond current HuggingFace architectures.
  • Broader hardware backend compatibility.
  • Expanded benchmarking for external validation and transparency.

The open-source repository enables active community participation, with contributions, issue reports, and feedback encouraged.

7. Mathematical Formulation and Optimization Criteria

Both frameworks are grounded in classic continuous optimization, formalized as:

argminxf(x)\operatorname{argmin}_x f(x)

For separable objectives:

f(x)=ifi(x)f(x) = \sum_{i} f_i(x)

These formulations guide implementation, particularly the handling of mini-batches and distributed computation in both systems. In the case of DeepSpeed-FastGen, concavity properties of latency functions inform the scheduling and batching mechanisms that yield observed throughput and latency gains.


In summary, the FastGen frameworks embodied by mlpack’s C++ infrastructure (1711.06581) and DeepSpeed-FastGen’s LLM serving system (2401.08671) exemplify generically extensible, high-performance computational design for modern machine learning applications. Each achieves its respective goals through careful architectural choices, well-defined interfaces, and a consistent focus on both runtime efficiency and deployment flexibility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)