DeepOBS Suite
- DeepOBS is an open-source Python suite that standardizes benchmarking protocols for stochastic optimizers, enabling reproducible comparison across diverse deep learning tasks.
- It automates experiment orchestration, data management, and output generation, including publication-ready LaTeX figures and tables.
- Supporting both TensorFlow and PyTorch, DeepOBS integrates hyperparameter grid searches and precomputed baselines for optimizers like SGD, Momentum, and Adam.
DeepOBS is an open-source Python suite designed to provide a standardized, reproducible protocol for benchmarking stochastic optimization algorithms in deep learning. It addresses challenges inherent in optimizer benchmarking—specifically generalization performance, optimizer tunability, and stochasticity induced by mini-batch sampling and random initialization. DeepOBS automates most benchmarking steps, supplies a wide and extensible range of test problems (from toy quartics to large-scale residual networks), and provides realistic, precomputed baselines for prominent optimizers. It is implemented in TensorFlow with support for PyTorch and integrates output backends producing publication-ready LaTeX figures and tables (Schneider et al., 2019).
1. Motivation and Benchmarking Challenges
Deep learning optimizer benchmarking involves three core challenges: generalization, stochasticity, and tunability. Generalization refers to the relevance of test-set performance versus training loss, as practitioners ultimately care about the model's out-of-sample accuracy. Stochasticity arises from mini-batch selection and random weights initialization, requiring statistical treatment of run-to-run variability. Tunability captures the fact that optimizers expose significantly different hyperparameter spaces, complicating fair and quantitative comparison. DeepOBS was developed to provide:
- Systematic protocols for reproducible, fair benchmarking of stochastic optimizers, including multiple runs with independent random seeds and reporting mean ± standard deviation for all core performance metrics.
- Automated routines for hyperparameter search on fixed grids.
- Built-in measurement of optimizer speed relative to a stochastic gradient descent (SGD) baseline, reporting both wall-clock and effective iteration costs.
- A turn-key package for experiment orchestration, data/model management, experiment logging, and output generation (Schneider et al., 2019).
2. Evaluation Protocol and Metrics
In DeepOBS, each benchmark problem is formalized by a parameter vector and a stochastic loss , with as the pointwise loss function. The evaluation protocol reports four curves for each optimizer/problem pair:
- Training loss: computed on a dedicated "train-eval" set.
- Test loss: on the held-out test set.
- Training accuracy: .
- Test accuracy: .
DeepOBS runs independent repeats (default) with different random seeds per optimizer/problem, reporting both mean and standard deviation: for .
Speed is quantified as a wall-clock ratio 0 (default: measured on MLP/MNIST), with performance also reported as a function of "effective iterations" (1). Hyperparameter tuning employs a log-uniform grid search (e.g., learning rate in 2) with the same budget for all optimizers, and performance spread is reported to characterize tunability (Schneider et al., 2019).
3. Package Structure and Workflow
The DeepOBS suite is organized into modular components: data/ (loading and preprocessing), models/ (architectures and loss definitions), runners/ (training and metrics logging), baselines/ (precomputed SGD, Momentum, Adam results), and visualize/ (generation of pgfplots and LaTeX summary tables).
Installation is enabled via PyPI and direct GitHub links:
4
A typical usage pattern involves selecting a test problem and optimizer, specifying hyperparameters (e.g., learning rate obtained from grid search), running 3 repeats:
5
Results are stored as metrics.npz in subdirectories for each run, which can be post-processed by the visualization toolkit. LaTeX-compatible plots and tables are generated with:
6
(Schneider et al., 2019)
4. Benchmark Problems and Precomputed Baselines
DeepOBS ships with 20+ test problems, with a core subset widely used for optimizer design iteration and realistic benchmarking. The "small set" (P1–P4) allows rapid prototyping; the "large set" (P5–P8) provides more realistic or computationally intensive tasks.
| Problem | Description |
|---|---|
| P1 | 2D nonconvex "Quartic" function |
| P2 | MNIST VAE (fully connected) |
| P3 | Fashion-MNIST CNN (small convnet) |
| P4 | CIFAR-10 small CNN |
| P5 | CIFAR-100 small ResNet |
| P6 | ImageNet ResNet-50 |
| P7 | Character-level LSTM on PTB |
| P8 | GAN on CelebA |
Baseline results (mean ± σ) are precomputed for SGD, Momentum, and Adam. These are supplied in the baselines folder and loaded automatically during visualization. For example, on the small test set:
| Optimizer | P1 Test Loss | P2 Test Loss | P3 Test Acc | P4 Test Acc |
|---|---|---|---|---|
| SGD | 87.11 ± 0.04 | 27.73 ± 0.06 | 63.3% ± 0.2 | 91.44% ± 0.09 |
| Momentum | 87.02 ± 0.05 | 52.92 ± 0.00 | 72.85% ± 0.15 | 92.54% ± 0.06 |
| Adam | 87.24 ± 0.03 | 52.92 ± 0.00 | 71.63% ± 0.10 | 92.26% ± 0.05 |
5. Output Generation and Reporting
DeepOBS provides two ready-to-use output backends:
- pgfplots/LaTeX for rendering learning curves.
- LaTeX tables for summarizing final performance, speed, and tunability.
The LaTeX output is designed for direct inclusion in academic publications, containing commands such as \addplot for pgfplots and tables with mean and standard deviation entries. For instance, the generated table small_baselines.tex has the following structure:
7
(Schneider et al., 2019)
6. Extensibility and Customization
DeepOBS is designed for extensibility:
- To register a new optimizer, a subclass of
tf.train.Optimizeror the PyTorch equivalent must be written and registered indeepobs/optimizers.py. - To add a new problem, a model file defining the architecture and loss should be placed in
deepobs/models/, and referenced indeepobs/testproblems.py. - Hyperparameter grid search configurations can be customized via YAML files within
deepobs/config/.
These features ensure that new optimizers and realistic tasks can be rapidly integrated and consistently benchmarked alongside established baselines (Schneider et al., 2019).
7. Recommended Practices and Protocol Compliance
Best practices established by DeepOBS for optimizer comparison include:
- Running at least 5–10 independent seeds and reporting metrics as mean ± standard deviation.
- Reporting all four primary curves: training and test loss, training and test accuracy.
- Learning rates should be tuned via log-uniform grid searches within a fixed budget across optimizers, and hyperparameter-sensitivity (tunability) should be reported.
- Per-iteration computational cost (τ) must be benchmarked relative to SGD, with performance plotted as a function of effective iterations (iterations·τ).
- Avoid cherry-picking: analysis should be performed within the fixed DeepOBS test suite.
- Hyperparameter-sensitivity plots should be included to demonstrate optimizer tunability.
The adherence to these guidelines ensures reliability, comparability, and reproducibility of results, addressing significant gaps in prior stochastic optimizer benchmarking literature (Schneider et al., 2019).