ParaDnn Benchmark Suite

Updated 13 September 2025

ParaDnn Benchmark Suite is a parameterized framework that generates thousands of deep neural network models with configurable architectures and hyperparameters.
It supports fully connected, convolutional, and recurrent neural networks to evaluate performance trade-offs across diverse workload types and hardware platforms.
The suite aids in identifying hardware bottlenecks and guiding optimization efforts by analyzing metrics like FLOPS utilization, memory usage, and inference time.

The ParaDnn Benchmark Suite is a parameterized framework designed to systematically evaluate deep learning models and hardware platforms through extensive sweeps of configurable model architectures and hyperparameters. Unlike static benchmarking suites containing only fixed workloads, ParaDnn enables the generation of end-to-end models encompassing fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks—thus supporting comprehensive performance characterization of both present and future deep learning paradigms.

1. Design Principles and Parameterization

ParaDnn is structured around the principle of exhaustive parameter sweeps, allowing the generation of thousands of neural network models with variable computational footprints. The suite defines architectural templates using formal notation and implements them to be flexible along multiple axes:

Fully-Connected Models (FC): Constructed as sequential layers, each layer parameterized by the number of nodes, input/output sizes, and overall model depth. The suite supports LaTeX-style formulations, e.g., $Input \rightarrow [Layer[Node]] \rightarrow Output$ .
Convolutional Neural Networks (CNN): Based on modern residual architectures. Models have four groups of residual/bottleneck blocks with parameters controlling the number of blocks per group and the number of channels/filters (which typically increase by a factor of two in each group), ending with a fully connected layer.
Recurrent Neural Networks (RNN): Configurable sequences of RNN, LSTM, or GRU cells. The architecture allows adjustment of input sequence length, embedding size, vocabulary size, and the number of layers.

All model types support a wide range of batch sizes and are built with tunable hyperparameter ranges, as specified in comprehensive tables in the source paper (Wang et al., 2019). This design enables ParaDnn to systematically generate benchmarks spanning up to six orders of magnitude in parameter count.

2. Supported Workload Classes and Benchmark Coverage

ParaDnn's model generation covers three principal workload classes in deep learning:

Model Type	Parameter Axes	Representative Architecture
Fully-Connected	Layers, nodes, batch	Input → [Layer[Node]] → Output
CNN	Blocks, filters, batch	Input → [Residual/Bottleneck Block]×4 → FC → Output
RNN	Layers, embedding, sequence	Input → [RNN/LSTM/GRU Cell] → Output

This parameterization allows researchers to paper scalability, resource utilization, and computational efficiency across diverse neural architectures and operational regions.

3. Hardware Platforms Evaluated

ParaDnn's benchmarking campaign covers the following hardware accelerators:

Google Cloud TPU v2/v3: Custom ASICs featuring peak compute of 180 TFLOPS (v2) and 420 TFLOPS (v3), with high memory bandwidth (2400 GB/s per board for v2). The v3 revision provides a 1.5× improvement in memory bandwidth and doubles memory per core. FLOPS utilization is maximized with wider models and large batch sizes, but diminished by multi-core communication overhead (e.g., up to ~38% reduction in FC model performance across 8 cores).
NVIDIA V100 GPU: 16 GB HBM2 memory, 900 GB/s bandwidth, up to 125 TFLOPS. Excels in flexible execution, especially for small batch sizes and irregular computation structures (e.g., non-MatMul operations).
Intel Skylake CPU: n1-standard-32 (Google Cloud), 16 physical cores, 120 GB DDR4. Although it has the lowest peak compute (~2 TFLOPS), it presents the largest per-core memory capacity, supporting oversized models that would exceed accelerator memory limits.

TPU utilization is shown to increase rapidly with model width and batch size, but only marginally with model depth, indicating underutilized parallelism in depth. GPU platforms outperform TPUs for small batch workflows and manage large FC models more efficiently, while CPUs provide competitive FLOPS utilization for some RNN workloads due mainly to their memory abundance.

4. Performance Indices and Analytical Methodology

ParaDnn facilitates detailed benchmarking by quantifying key performance indices:

Recognition Accuracy: Not a primary focus in synthetic model experiments, but relevant in coupled evaluations with real-world models (e.g., ResNet-50, MobileNet).
Model Complexity: Parameterized via model depth, width, and total parameter count; benchmarks sweep from low to extremely high values (up to six orders of magnitude).
Computational Complexity: Reported as FLOPS utilization, operation types, and arithmetic intensity. Roofline analyses are conducted to categorize compute-bound vs. memory-bound workloads.
Memory Usage: Quantified in terms of static allocation (parameter size) and dynamic allocation (batch dependence).
Throughput/Inference Time: Measured in milliseconds and converted to frames per second (FPS) for comparison. Performance is strongly affected by software stack optimizations.

Special emphasis is placed on platform-specific bottlenecks: For TPU, inter-core communication and infeed bandwidth; for GPU, memory management and warp scheduling; for CPU, throughput limitations but scalability in model size.

5. Real-World Model Integration and Applicability

Beyond synthetic sweeps, ParaDnn incorporates six canonical models—Transformer, ResNet-50, RetinaNet, DenseNet, MobileNet, and SqueezeNet—to anchor benchmarks in realistic settings. Performance comparisons reveal:

TPUs achieve maximal throughput in compute-dense, regular workloads (high batch and width), especially with v3’s improvements ( $\sim$ 3× or more in certain operations).
GPUs provide the highest performance for irregular computations, small batch sizes, and oversized FC models. Memory management is more robust compared to TPUs.
CPUs support the largest models due to memory headroom, although with lower maximum throughput.

Efficiency gains from software stacks are pronounced: TensorFlow’s XLA compiler yields up to 7× speedup for FC models and 2.5× for RNNs over seven months; CUDA and bfloat16 quantization improve GPU performance by reducing memory load.

6. Benchmarking Insights and Design Implications

ParaDnn reveals significant insights regarding system co-design:

Accelerator FLOPS are best exploited with large batch sizes and wide models.
Multi-chip overheads and memory bandwidth constraints impose limitations, especially on FC and RNN workloads that span multiple TPU cores.
GPUs’ flexibility mitigates small-batch performance degradation and supports irregular model structures.
Software enhancements can dramatically alter performance post-hardware deployment, underscoring the need for continuous benchmarking as frameworks evolve.

A plausible implication is that future hardware designs should address memory-bandwidth bottlenecks and inter-core communication efficiency to further boost deep learning throughput.

7. Significance, Availability, and Future Directions

ParaDnn serves as both a probe for hardware-software systems and a comprehensive benchmarking toolkit. Its exhaustive parameterization allows deep insights into compute/memory trade-offs and platform-specific behaviors across FC, CNN, and RNN workloads. Versatility in model synthesis positions it as an invaluable resource for hardware architects, framework developers, and applied researchers.

The suite, coupled with real-world models and systematic hyperparameter sweeps, enables identification of hardware bottlenecks, measurement of software stack improvements, and guidance for platform selection based on application requirements. As the landscape evolves toward ever-larger models and specialized accelerators, ParaDnn’s parameterized methodology provides a scalable basis for future benchmarking studies.

All benchmarking models and detailed methodologies are publicly documented and reproducible, facilitating rapid iteration and cross-platform investigation as deep learning frameworks and hardware architectures advance (Wang et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning (2019)

Follow Topic

Get notified by email when new papers are published related to ParaDnn Benchmark Suite.