DeepOBS Suite

Updated 16 May 2026

DeepOBS is an open-source Python suite that standardizes benchmarking protocols for stochastic optimizers, enabling reproducible comparison across diverse deep learning tasks.
It automates experiment orchestration, data management, and output generation, including publication-ready LaTeX figures and tables.
Supporting both TensorFlow and PyTorch, DeepOBS integrates hyperparameter grid searches and precomputed baselines for optimizers like SGD, Momentum, and Adam.

DeepOBS is an open-source Python suite designed to provide a standardized, reproducible protocol for benchmarking stochastic optimization algorithms in deep learning. It addresses challenges inherent in optimizer benchmarking—specifically generalization performance, optimizer tunability, and stochasticity induced by mini-batch sampling and random initialization. DeepOBS automates most benchmarking steps, supplies a wide and extensible range of test problems (from toy quartics to large-scale residual networks), and provides realistic, precomputed baselines for prominent optimizers. It is implemented in TensorFlow with support for PyTorch and integrates output backends producing publication-ready LaTeX figures and tables (Schneider et al., 2019).

1. Motivation and Benchmarking Challenges

Deep learning optimizer benchmarking involves three core challenges: generalization, stochasticity, and tunability. Generalization refers to the relevance of test-set performance versus training loss, as practitioners ultimately care about the model's out-of-sample accuracy. Stochasticity arises from mini-batch selection and random weights initialization, requiring statistical treatment of run-to-run variability. Tunability captures the fact that optimizers expose significantly different hyperparameter spaces, complicating fair and quantitative comparison. DeepOBS was developed to provide:

Systematic protocols for reproducible, fair benchmarking of stochastic optimizers, including multiple runs with independent random seeds and reporting mean ± standard deviation for all core performance metrics.
Automated routines for hyperparameter search on fixed grids.
Built-in measurement of optimizer speed relative to a stochastic gradient descent (SGD) baseline, reporting both wall-clock and effective iteration costs.
A turn-key package for experiment orchestration, data/model management, experiment logging, and output generation (Schneider et al., 2019).

2. Evaluation Protocol and Metrics

In DeepOBS, each benchmark problem is formalized by a parameter vector $\theta \in \mathbb{R}^d$ and a stochastic loss $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ , with $\ell(\theta; x)$ as the pointwise loss function. The evaluation protocol reports four curves for each optimizer/problem pair:

Training loss: $L(\theta_t)$ computed on a dedicated "train-eval" set.
Test loss: $L(\theta_t)$ on the held-out test set.
Training accuracy: $\text{acc}_\text{train}(\theta_t)$ .
Test accuracy: $\text{acc}_\text{test}(\theta_t)$ .

DeepOBS runs $R = 10$ independent repeats (default) with different random seeds per optimizer/problem, reporting both mean and standard deviation: $\mu(t) = \frac{1}{R} \sum_i y_i(t),\quad \sigma(t) = \sqrt{\frac{1}{R-1} \sum_i (y_i(t)-\mu(t))^2}$ for $y \in \{\text{train\_loss}, \text{test\_loss}, \text{train\_acc}, \text{test\_acc}\}$ .

Speed is quantified as a wall-clock ratio $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 0 (default: measured on MLP/MNIST), with performance also reported as a function of "effective iterations" ( $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 1). Hyperparameter tuning employs a log-uniform grid search (e.g., learning rate in $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 2) with the same budget for all optimizers, and performance spread is reported to characterize tunability (Schneider et al., 2019).

3. Package Structure and Workflow

The DeepOBS suite is organized into modular components: data/ (loading and preprocessing), models/ (architectures and loss definitions), runners/ (training and metrics logging), baselines/ (precomputed SGD, Momentum, Adam results), and visualize/ (generation of pgfplots and LaTeX summary tables).

Installation is enabled via PyPI and direct GitHub links: $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 4 A typical usage pattern involves selecting a test problem and optimizer, specifying hyperparameters (e.g., learning rate obtained from grid search), running $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 3 repeats: $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 5 Results are stored as metrics.npz in subdirectories for each run, which can be post-processed by the visualization toolkit. LaTeX-compatible plots and tables are generated with: $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 6 (Schneider et al., 2019)

4. Benchmark Problems and Precomputed Baselines

DeepOBS ships with 20+ test problems, with a core subset widely used for optimizer design iteration and realistic benchmarking. The "small set" (P1–P4) allows rapid prototyping; the "large set" (P5–P8) provides more realistic or computationally intensive tasks.

Problem	Description
P1	2D nonconvex "Quartic" function
P2	MNIST VAE (fully connected)
P3	Fashion-MNIST CNN (small convnet)
P4	CIFAR-10 small CNN
P5	CIFAR-100 small ResNet
P6	ImageNet ResNet-50
P7	Character-level LSTM on PTB
P8	GAN on CelebA

Baseline results (mean ± σ) are precomputed for SGD, Momentum, and Adam. These are supplied in the baselines folder and loaded automatically during visualization. For example, on the small test set:

Optimizer	P1 Test Loss	P2 Test Loss	P3 Test Acc	P4 Test Acc
SGD	87.11 ± 0.04	27.73 ± 0.06	63.3% ± 0.2	91.44% ± 0.09
Momentum	87.02 ± 0.05	52.92 ± 0.00	72.85% ± 0.15	92.54% ± 0.06
Adam	87.24 ± 0.03	52.92 ± 0.00	71.63% ± 0.10	92.26% ± 0.05

(Schneider et al., 2019)

5. Output Generation and Reporting

DeepOBS provides two ready-to-use output backends:

pgfplots/LaTeX for rendering learning curves.
LaTeX tables for summarizing final performance, speed, and tunability.

The LaTeX output is designed for direct inclusion in academic publications, containing commands such as \addplot for pgfplots and tables with mean and standard deviation entries. For instance, the generated table small_baselines.tex has the following structure: $L(\theta) = \mathbb{E}_x[\ell(\theta; x)]$ 7 (Schneider et al., 2019)

6. Extensibility and Customization

DeepOBS is designed for extensibility:

To register a new optimizer, a subclass of tf.train.Optimizer or the PyTorch equivalent must be written and registered in deepobs/optimizers.py.
To add a new problem, a model file defining the architecture and loss should be placed in deepobs/models/, and referenced in deepobs/testproblems.py.
Hyperparameter grid search configurations can be customized via YAML files within deepobs/config/.

These features ensure that new optimizers and realistic tasks can be rapidly integrated and consistently benchmarked alongside established baselines (Schneider et al., 2019).

7. Recommended Practices and Protocol Compliance

Best practices established by DeepOBS for optimizer comparison include:

Running at least 5–10 independent seeds and reporting metrics as mean ± standard deviation.
Reporting all four primary curves: training and test loss, training and test accuracy.
Learning rates should be tuned via log-uniform grid searches within a fixed budget across optimizers, and hyperparameter-sensitivity (tunability) should be reported.
Per-iteration computational cost (τ) must be benchmarked relative to SGD, with performance plotted as a function of effective iterations (iterations·τ).
Avoid cherry-picking: analysis should be performed within the fixed DeepOBS test suite.
Hyperparameter-sensitivity plots should be included to demonstrate optimizer tunability.

The adherence to these guidelines ensures reliability, comparability, and reproducibility of results, addressing significant gaps in prior stochastic optimizer benchmarking literature (Schneider et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

DeepOBS: A Deep Learning Optimizer Benchmark Suite (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepOBS Suite.

DeepOBS Suite

1. Motivation and Benchmarking Challenges

2. Evaluation Protocol and Metrics

3. Package Structure and Workflow

4. Benchmark Problems and Precomputed Baselines

5. Output Generation and Reporting

6. Extensibility and Customization

7. Recommended Practices and Protocol Compliance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeepOBS Suite

1. Motivation and Benchmarking Challenges

2. Evaluation Protocol and Metrics

3. Package Structure and Workflow

4. Benchmark Problems and Precomputed Baselines

5. Output Generation and Reporting

6. Extensibility and Customization

7. Recommended Practices and Protocol Compliance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research