DeepOBS Benchmark Suite

Updated 22 June 2026

DeepOBS is an open-source Python package that provides a complete, reproducible benchmark suite for stochastic optimization in deep learning.
It automates optimizer assessments using pre-tuned baselines, rigorous evaluation protocols, and diverse, realistic test problems.
DeepOBS supports major frameworks like TensorFlow and PyTorch and generates publication-ready outputs to enhance transparent optimizer research.

DeepOBS is an open-source Python package that implements a complete, reproducible, and extensible benchmark suite for stochastic optimization in deep learning. It automates the assessment and comparison of deep learning optimizers, addresses major reproducibility and fairness challenges, and provides a collection of realistic, ready-to-use test problems, pre-tuned baseline results, fair evaluation protocols, and publication-ready output. DeepOBS supports major frameworks, such as TensorFlow and PyTorch, and is designed to accelerate empirical optimizer research by establishing common standards for rigorous benchmarking (Schneider et al., 2019).

1. Motivation and Objectives

Deep learning optimizer research has historically lacked a standard, reproducible protocol for fair evaluation. Typical challenges include the stochastic nature of deep learning (random seeds, minibatches), the discrepancy between training and generalization performance, the need to properly tune baselines, and risks of cherry-picking results. DeepOBS was created to automate best practices: ensure unbiased assessment, supply a diverse suite of test problems, provide carefully tuned baselines, and generate publication-ready outputs. The primary objectives are to enable quantitative, reproducible evaluation of stochastic optimizers and support transparent scientific comparison (Schneider et al., 2019).

2. System Architecture

DeepOBS is organized into six principal layers, each encapsulating a key aspect of the benchmarking workflow:

Layer	Functionality	Example Components
Data Loading	Download, preprocess, batch, augment datasets	MNIST, Fashion-MNIST, CIFAR-10, PTB-char
Models	Define "test problems" (dataset + architecture)	MLPs, CNNs (VGG, ResNet), VAE, RNNs, toys
Runners	Orchestrate training/evaluation, manage seeds	SingleRunBenchmark, multiple-seed execution
Baselines	Precomputed baseline results for optimizers	SGD, SGD+Momentum, Adam
Runtime Est.	Measure optimizer performance vs. SGD reference	Wall-clock time ratio, per-epoch metrics
Visualization	Generate .tex output for learning curves/tables	pgfplots code, ready for publication

Test problems span four domains: image classification (e.g., MNIST fully connected net, Fashion-MNIST CNN, VGG/ResNet on CIFAR-10, ResNet-50 on ImageNet), generative modeling (VAE on MNIST), natural language modeling (character-level RNN on PTB), and synthetic or toy problems (e.g., 2-D nonconvex "Quadratic Deep") (Schneider et al., 2019).

3. Evaluation Protocol

Stochasticity and Seeds

Each optimizer–problem pair is run $R \geq 5$ times with different random seeds. For each, all four critical performance curves are tracked: training loss $L(\theta)$ , training accuracy $a_{\text{train}}$ , test loss $L(\theta)_{\text{test}}$ , and test accuracy $a_{\text{test}}$ .

Aggregation and Metrics

Aggregate statistics are computed as follows:

Mean test error:

$\bar{e}_{\text{test}} = \frac{1}{R} \sum_{r=1}^R e_{\text{test}}^{(r)}$

Standard deviation:

$\sigma_e = \sqrt{\frac{1}{R-1} \sum_{r=1}^R \left(e_{\text{test}}^{(r)} - \bar{e}_{\text{test}}\right)^2}$

Additional metrics: reached performance thresholds (epochs or gradient evaluations until threshold), runtime (wall-clock and relative to SGD), tunability (spread of best hyperparameters, visualization of error vs. $\log_{10}\alpha$ ).

Hyperparameter Tuning

By default, DeepOBS sweeps the learning rate $\alpha$ over a logarithmic grid: $\alpha \in [10^{-5}, 10^{2}]$ , using 36 grid points. The optimal $L(\theta)$ 0 (maximal test accuracy) is selected and reported, typically over $L(\theta)$ 1 runs. Tunability visualization is produced by plotting final test error as a function of $L(\theta)$ 2 (Schneider et al., 2019).

4. Workflow and User Interfaces

DeepOBS supports multiple paradigms for usage:

Python API (TensorFlow example):

$L(\theta)$ 5

Python API (PyTorch example):

$L(\theta)$ 6

Command-Line Interface:

deepobs-run --problem CIFAR10SmallCNN --opt SGD --lr 0.1 --runs 10

Batch Configuration:

Complex experiments (multiple optimizers and problems) may be specified via JSON or YAML configuration files.

5. Output Generation and Publication Integration

DeepOBS's output back-ends generate .tex files containing pgfplots code for learning curves (loss/accuracy vs. epochs) and tabulated summary statistics. Example .tex output for comparing optimizers on CIFAR-10:

$L(\theta)$ 7 These outputs are intended for direct integration in academic manuscripts (Schneider et al., 2019).

6. Baseline Results and Comparative Insights

DeepOBS offers precomputed, hyperparameter-tuned baseline results for commonly used optimizers (SGD, SGD+Momentum, Adam) across both "small" and "large" test problems. Key empirical findings:

No single optimizer performs best on every problem.
Adam typically requires less tuning (optimal $L(\theta)$ 3 in $L(\theta)$ 4); however, on CIFAR-100, Momentum outperforms Adam.
Ranking by training loss does not reliably predict generalization ranking.
Trade-offs between speed (epochs or wall-clock time) and generalization performance can be systematically quantified using DeepOBS protocols (Schneider et al., 2019).

7. Extensibility and Fair Evaluation Standards

Researchers may extend DeepOBS with custom optimizers or new test problems:

Adding an Optimizer:

Subclass the TensorFlow/PyTorch API or provide a custom update function, register defaults, and (optionally) recompute baselines or submit results.

Adding a Test Problem:

Implement a DataSet class and a Model class defining forward pass and loss; register in the testproblems module.

When performing comparative studies, DeepOBS recommends:

Uniform random seed protocol and equal number of runs.
Consistent hyperparameter search budgets (e.g., identical grid size or Bayesian optimization limits).
Runtime always measured relative to SGD on a reference task.
Reporting all four canonical learning curves and aggregated summary statistics.

A plausible implication is that adherence to these standards, coupled with fast feedback on "small" problems and scalability to "large" tasks, provides an efficient, fair platform for optimizer development and evaluation (Schneider et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

DeepOBS: A Deep Learning Optimizer Benchmark Suite (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepOBS.