FD-Bench: Modular Benchmarking for Fluid Simulation
- FD-Bench is a modular benchmarking framework for data-driven fluid simulation that decomposes neural PDE solvers into spatial, temporal, and loss modules for clear, fair evaluation.
- The framework provides a unified, open-source codebase with 10 flow datasets and 85 baseline models, enabling direct comparisons between machine learning and classical CFD solvers.
- FD-Bench supports detailed performance metrics and ablation studies, offering actionable insights into architecture trade-offs such as self-attention, Fourier, and graph-based approaches.
FD-Bench refers to three distinct, state-of-the-art benchmarking frameworks in different computational research domains: (1) a modular and fair benchmark for data-driven fluid simulation (Wang et al., 25 May 2025), (2) an automated benchmarking framework for digital forensic tool validation (“AutoDFBench 1.0”) (Wickramasekara et al., 18 Dec 2025), and (3) a full-duplex benchmarking pipeline for spoken dialogue systems (Peng et al., 25 Jul 2025). Each incarnation provides a domain-specific methodology for reproducible, granular, and extensible evaluation, addressing core limitations in respective subfields. The subsequent sections focus on the comprehensive and rigorous FD-Bench for data-driven fluid simulation (Wang et al., 25 May 2025), with brief notes on the other variants at the end.
1. Motivation and Design Principles
FD-Bench was introduced to unify and standardize the assessment of neural PDE solvers for fluid dynamics, motivated by three persistent limitations: fragmented PDE datasets, entangled architecture innovations (spatial, temporal, loss), and the absence of systematic evaluation protocols—especially in comparison to classical CFD solvers. Its design is characterized as fair, modular, comprehensive, and reproducible. The framework addresses these gaps with four primary contributions:
- Modular decomposition of solver architectures into spatial, temporal, and loss axes, enabling direct ablation and apples-to-apples evaluation.
- A framework for direct comparison to traditional numerical solvers at matched error regimes.
- Fine-grained generalization analysis over spatial resolution, initial/boundary conditions, and rollout horizons.
- An open-source, extensible codebase encompassing 10 flow datasets and 85 baseline re-implementations under a unified API (Wang et al., 25 May 2025).
2. Modular Architecture: Spatial, Temporal, and Loss Decomposition
A salient innovation is the modularization of neural PDE solvers into three orthogonal components:
Spatial Module (): Encodes flow fields at fixed time into latent representations . FD-Bench enumerates:
- Fourier/spectral mixing:
- Self-attention:
- Spatial convolutions, graph convolutions, reduced-order (POD), and implicit neural representations
Temporal Module (): Aggregates flows across time into a dynamic or trajectory representation:
- Autoregression (AR), next-step rollout, temporal bundling (joint prediction of steps), temporal self-attention, neural ODEs
Loss Module: Defines the training and evaluation objective, including:
- Physical MSE:
- Diffusion denoising, flow matching losses, and PINN residuals
By systematically combining 5 spatial × 5 temporal × 4 loss options, FD-Bench defines a grid of 85 baseline models, ensuring methodological consistency and allowing module-level ablation (Wang et al., 25 May 2025).
3. Scope: Datasets and Problem Coverage
FD-Bench encompasses 10 canonical 2D flow scenarios spanning a range of PDE structures and complexity:
| Scenario | Data Type | Spatial Resolutions |
|---|---|---|
| Incompressible Navier–Stokes (NS) | Grid (vorticity form) | – |
| Compressible NS | Grid | – |
| Stochastic NS (white noise) | Grid | – |
| Kolmogorov flow (forced) | Grid | – |
| Diffusion–Reaction (FitzHugh–Nagumo) | Grid | – |
| Taylor–Green vortex | SPH particle | Varies |
| Reverse Poiseuille flow | SPH particle | Varies |
| Advection (linear PDE) | Grid | – |
| Lid-driven cavity | SPH particle | Varies |
| Burgers’ equation | Grid | – |
Each dataset provides high-resolution (up to ), multi-condition simulations (Reynolds number, viscosity, Mach number), and at least 100–1000 time steps. Train/val/test splits are fixed for comparability (Wang et al., 25 May 2025).
4. Baseline Models, Comparison Methodology, and Classical Solver Integration
FD-Bench standardizes the re-implementation of 85 baselines across all major neural PDE solver families and modules:
- Spatial: Fourier-based (FNO, AFNO, Geo-FNO), graph-based (MeshGraphNets, GNS), convolutional (U-Net, CNN), self-attention (Transolver, HAMLET), implicit neural solvers (DINo), ROM, MLP-based architectures (DeepONet).
- Temporal: Autoregressive, next-step, temporal bundling, ODEs, attention across time.
- Loss: MSE, residual, diffusion/noise, flow matching losses.
Traditional CFD solvers are included via pseudo-spectral or finite-volume solvers (e.g., semi-implicit Heun for incompressible NS), calibrated to match the one-step error regime of neural baselines, enabling normalized, direct accuracy/performance comparisons.
5. Evaluation Protocols and Metrics
Training is performed on 8×A6000 GPUs with Adam optimizer, cosine annealing schedule, and grid-searched hyper-parameters. Evaluation splits are fixed per dataset. The following metrics are standardized:
- RMSE:
- Normalized RMSE (nRMSE): RMSE relative to the norm of the true field
- fRMSE: Frequency-band MSE in low/mid/high Fourier bands
- Efficiency: Inference time and memory (GFLOPs, GB)
Generalization analysis is performed along three axes: (a) initial-condition shifts (zero-shot turbulence/forcing), (b) resolution transfer (e.g., training on and testing on ), and (c) long-horizon rollouts. Quantitatively, self-attention + temporal bundling + MSE achieves the lowest RMSE for compressible NS (0.057) (Wang et al., 25 May 2025).
6. Empirical Results and Leaderboard Findings
FD-Bench ranks baseline and hybrid configurations for each scenario by RMSE, fRMSE, and runtime. Key results:
- Compressible NS: Self-attention + bundling + MSE is optimal
- Diffusion–Reaction: Fourier + next-step + MSE ranks highest
- Kolmogorov flow: Self-attention + bundling excels
Empirical findings:
- Self-attention modules provide the highest accuracy at higher computational cost
- Fourier modules yield strong accuracy vs. efficiency trade-offs
- Temporal bundling outperforms AR/ODE for fixed compute
- Neural ODEs, although suboptimal at fixed compute, support irregular sampling
- Classical solvers are consistently outperformed in speed and accuracy by neural FNOs (10–50× speedups)
- Eulerian (grid/mesh) discretizations exhibit superior long-horizon performance compared to Lagrangian (particle-based) methods
The public leaderboard and modular codebase support reproducibility, extension, and robust evaluation standards (Wang et al., 25 May 2025).
7. Future Directions and Impact
FD-Bench sets a foundational reproducibility and comparability standard for data-driven fluid dynamics. Recommendations include:
- Hybridization of Fourier priors with self-attention modules
- Efficient temporal models (sparse/linear attention)
- Hierarchical Eulerian–Lagrangian couplings to mitigate roll-out error
- Broader integration of stochastic and flow-matching losses for uncertainty quantification
The framework enables principled selection, ablation, and improvement of machine learning-based PDE solvers. Its public codebase and extensible API facilitate experimental rigor and extension by the broader community.
Notes on Alternate Frameworks Named "FD-Bench"
- AutoDFBench 1.0 (Wickramasekara et al., 18 Dec 2025): A modular benchmarking suite for digital forensic tool validation, integrating five CFTT areas. Uses deterministic, per-test-case F1-score-based evaluation, a RESTful API, and comprehensive ground truth for reproducible, cross-tool assessment.
- FD-Bench (Full-Duplex Dialogue) (Peng et al., 25 Jul 2025): A benchmarking pipeline for evaluating spoken dialogue systems in full-duplex scenarios. Features simulation pipelines including GPT-4o dialogue synthesis, TTS with noise control, and interruption/timing-aware metrics such as SRR, SIR, SRIR, EIR, and various latency/quality measures.
Each variant is independently developed; context must be established for unambiguous reference.