Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepOBS Suite

Updated 16 May 2026
  • DeepOBS is an open-source Python suite that standardizes benchmarking protocols for stochastic optimizers, enabling reproducible comparison across diverse deep learning tasks.
  • It automates experiment orchestration, data management, and output generation, including publication-ready LaTeX figures and tables.
  • Supporting both TensorFlow and PyTorch, DeepOBS integrates hyperparameter grid searches and precomputed baselines for optimizers like SGD, Momentum, and Adam.

DeepOBS is an open-source Python suite designed to provide a standardized, reproducible protocol for benchmarking stochastic optimization algorithms in deep learning. It addresses challenges inherent in optimizer benchmarking—specifically generalization performance, optimizer tunability, and stochasticity induced by mini-batch sampling and random initialization. DeepOBS automates most benchmarking steps, supplies a wide and extensible range of test problems (from toy quartics to large-scale residual networks), and provides realistic, precomputed baselines for prominent optimizers. It is implemented in TensorFlow with support for PyTorch and integrates output backends producing publication-ready LaTeX figures and tables (Schneider et al., 2019).

1. Motivation and Benchmarking Challenges

Deep learning optimizer benchmarking involves three core challenges: generalization, stochasticity, and tunability. Generalization refers to the relevance of test-set performance versus training loss, as practitioners ultimately care about the model's out-of-sample accuracy. Stochasticity arises from mini-batch selection and random weights initialization, requiring statistical treatment of run-to-run variability. Tunability captures the fact that optimizers expose significantly different hyperparameter spaces, complicating fair and quantitative comparison. DeepOBS was developed to provide:

  • Systematic protocols for reproducible, fair benchmarking of stochastic optimizers, including multiple runs with independent random seeds and reporting mean ± standard deviation for all core performance metrics.
  • Automated routines for hyperparameter search on fixed grids.
  • Built-in measurement of optimizer speed relative to a stochastic gradient descent (SGD) baseline, reporting both wall-clock and effective iteration costs.
  • A turn-key package for experiment orchestration, data/model management, experiment logging, and output generation (Schneider et al., 2019).

2. Evaluation Protocol and Metrics

In DeepOBS, each benchmark problem is formalized by a parameter vector θ∈Rd\theta \in \mathbb{R}^d and a stochastic loss L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)], with ℓ(θ;x)\ell(\theta; x) as the pointwise loss function. The evaluation protocol reports four curves for each optimizer/problem pair:

  • Training loss: L(θt)L(\theta_t) computed on a dedicated "train-eval" set.
  • Test loss: L(θt)L(\theta_t) on the held-out test set.
  • Training accuracy: acctrain(θt)\text{acc}_\text{train}(\theta_t).
  • Test accuracy: acctest(θt)\text{acc}_\text{test}(\theta_t).

DeepOBS runs R=10R = 10 independent repeats (default) with different random seeds per optimizer/problem, reporting both mean and standard deviation: μ(t)=1R∑iyi(t),σ(t)=1R−1∑i(yi(t)−μ(t))2\mu(t) = \frac{1}{R} \sum_i y_i(t),\quad \sigma(t) = \sqrt{\frac{1}{R-1} \sum_i (y_i(t)-\mu(t))^2} for y∈{train_loss,test_loss,train_acc,test_acc}y \in \{\text{train\_loss}, \text{test\_loss}, \text{train\_acc}, \text{test\_acc}\}.

Speed is quantified as a wall-clock ratio L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]0 (default: measured on MLP/MNIST), with performance also reported as a function of "effective iterations" (L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]1). Hyperparameter tuning employs a log-uniform grid search (e.g., learning rate in L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]2) with the same budget for all optimizers, and performance spread is reported to characterize tunability (Schneider et al., 2019).

3. Package Structure and Workflow

The DeepOBS suite is organized into modular components: data/ (loading and preprocessing), models/ (architectures and loss definitions), runners/ (training and metrics logging), baselines/ (precomputed SGD, Momentum, Adam results), and visualize/ (generation of pgfplots and LaTeX summary tables).

Installation is enabled via PyPI and direct GitHub links: L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]4 A typical usage pattern involves selecting a test problem and optimizer, specifying hyperparameters (e.g., learning rate obtained from grid search), running L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]3 repeats: L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]5 Results are stored as metrics.npz in subdirectories for each run, which can be post-processed by the visualization toolkit. LaTeX-compatible plots and tables are generated with: L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]6 (Schneider et al., 2019)

4. Benchmark Problems and Precomputed Baselines

DeepOBS ships with 20+ test problems, with a core subset widely used for optimizer design iteration and realistic benchmarking. The "small set" (P1–P4) allows rapid prototyping; the "large set" (P5–P8) provides more realistic or computationally intensive tasks.

Problem Description
P1 2D nonconvex "Quartic" function
P2 MNIST VAE (fully connected)
P3 Fashion-MNIST CNN (small convnet)
P4 CIFAR-10 small CNN
P5 CIFAR-100 small ResNet
P6 ImageNet ResNet-50
P7 Character-level LSTM on PTB
P8 GAN on CelebA

Baseline results (mean ± σ) are precomputed for SGD, Momentum, and Adam. These are supplied in the baselines folder and loaded automatically during visualization. For example, on the small test set:

Optimizer P1 Test Loss P2 Test Loss P3 Test Acc P4 Test Acc
SGD 87.11 ± 0.04 27.73 ± 0.06 63.3% ± 0.2 91.44% ± 0.09
Momentum 87.02 ± 0.05 52.92 ± 0.00 72.85% ± 0.15 92.54% ± 0.06
Adam 87.24 ± 0.03 52.92 ± 0.00 71.63% ± 0.10 92.26% ± 0.05

(Schneider et al., 2019)

5. Output Generation and Reporting

DeepOBS provides two ready-to-use output backends:

  • pgfplots/LaTeX for rendering learning curves.
  • LaTeX tables for summarizing final performance, speed, and tunability.

The LaTeX output is designed for direct inclusion in academic publications, containing commands such as \addplot for pgfplots and tables with mean and standard deviation entries. For instance, the generated table small_baselines.tex has the following structure: L(θ)=Ex[ℓ(θ;x)]L(\theta) = \mathbb{E}_x[\ell(\theta; x)]7 (Schneider et al., 2019)

6. Extensibility and Customization

DeepOBS is designed for extensibility:

  • To register a new optimizer, a subclass of tf.train.Optimizer or the PyTorch equivalent must be written and registered in deepobs/optimizers.py.
  • To add a new problem, a model file defining the architecture and loss should be placed in deepobs/models/, and referenced in deepobs/testproblems.py.
  • Hyperparameter grid search configurations can be customized via YAML files within deepobs/config/.

These features ensure that new optimizers and realistic tasks can be rapidly integrated and consistently benchmarked alongside established baselines (Schneider et al., 2019).

Best practices established by DeepOBS for optimizer comparison include:

  • Running at least 5–10 independent seeds and reporting metrics as mean ± standard deviation.
  • Reporting all four primary curves: training and test loss, training and test accuracy.
  • Learning rates should be tuned via log-uniform grid searches within a fixed budget across optimizers, and hyperparameter-sensitivity (tunability) should be reported.
  • Per-iteration computational cost (Ï„) must be benchmarked relative to SGD, with performance plotted as a function of effective iterations (iterations·τ).
  • Avoid cherry-picking: analysis should be performed within the fixed DeepOBS test suite.
  • Hyperparameter-sensitivity plots should be included to demonstrate optimizer tunability.

The adherence to these guidelines ensures reliability, comparability, and reproducibility of results, addressing significant gaps in prior stochastic optimizer benchmarking literature (Schneider et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepOBS Suite.