NNPDF Framework: Neural PDF Determination
- NNPDF Framework is a machine-learning system that determines proton, nuclear, and photon parton distribution functions using neural network ensembles and Monte Carlo replicas.
- It employs ensemble training with GPU acceleration and Bayesian hyperparameter optimization to enhance accuracy and efficiency in PDF extraction.
- Recent advancements include integrated nuclear fits, approximate N³LO evolution, and significant computational cost and energy reductions for large-scale QCD analyses.
The NNPDF (Neural Network Parton Distribution Function) framework is a comprehensive machine-learning–based system for the nonperturbative determination of proton, nuclear, and photon parton distribution functions (PDFs) from high-energy scattering data, with minimal theoretical bias and robust, data-driven uncertainty quantification. Originating in collider QCD phenomenology, NNPDF employs an ensemble-of-neural-network approach for PDF parametrization, a Monte Carlo method to propagate all sources of experimental and theoretical uncertainty, and a statistically rigorous workflow for cross-validation and hyperparameter optimization. Recent methodological advances have included GPU-accelerated ensemble training, Bayesian hyperparameter optimization, integrated treatment of proton and nuclear PDFs, and the support for approximate N³LO DGLAP evolution.
1. Theoretical Foundations
At the core of NNPDF is the view of PDF determination as an inference problem in function space, where the observable is not a finite-dimensional parameter vector but a distributional functional, , for each parton flavor , as a function of momentum fraction and energy scale . The extraction is accomplished via an ensemble of Monte Carlo replicas, each representing a possible realization of the PDFs consistent with experimental data and their uncertainties:
The ensemble mean and variance give, respectively, the PDF central value and one-sigma uncertainty:
PDF evolution with is performed by solving the DGLAP equations, up to NNLO or, in recent extensions, approximate N³LO (aN³LO) using constraint-based parametrizations of the splitting functions. The aN³LO evolution includes both missing-higher-order uncertainties (MHOU) from renormalization/factorization scale variation and incomplete-higher-order uncertainties (IHOU) from ansatz spread, combined in the theory covariance matrix for full error propagation (Hekhorn et al., 2023).
In the integrated framework for nuclear and proton PDFs, the atomic mass number is included as an explicit network input, allowing a single neural network to learn the – and –dependence of the distributions up to lead () (Rabemananjara, 2023).
2. Neural Network PDF Parametrization
Each PDF is represented at the input scale as
where are normalization and preprocessing exponents (possibly dependent in the nuclear case), and is a feed-forward neural network:
- Input: (and optionally , ),
- Hidden layers: typically , per flavor in NNPDF4.0, with activation,
- Output: linear activation,
- The total number of weights and biases per PDF is typically .
Preprocessing exponents are chosen randomly within broad intervals and can be further optimized as hyperparameters. Sum rules (momentum conservation, valence) and PDF positivity are enforced via Lagrange-multiplier penalty terms appended to the loss function:
The penalty strengths themselves are subject to hyperoptimization (Cruz-Martinez et al., 21 Oct 2024).
3. Statistical and Computational Methodology
Monte Carlo Replica Procedure and Loss Function
NNPDF generates artificial data replicas , each constructed by fluctuating the experimental measurements according to the full covariance matrix (statistical + systematic + normalization errors). For each replica, a separate network is trained by minimizing a regularized loss,
where is the theoretical prediction for the relevant observables, obtained by convoluting the neural-net parameterized PDFs with precomputed FastKernel (FK) tables for rapid evaluation of DGLAP evolution and hard cross-sections (Ball et al., 2021).
After minimization, postfit filtering discards outlier replicas (e.g., worst 10% by ), stabilizing ensemble statistics.
Cross-Validation, Early Stopping, and Ensemble Metrics
Crucial to avoiding over- and underfitting is cross-validation. Each replica's data are split into training and validation sets (e.g., via -fold, ), and training is halted when the validation loss ceases to improve. Model selection employs ensemble-based statistical estimators:
- First-moment metric (generalization ): averages both fit quality and PDF spread.
- Second-moment metric (): measures PDF uncertainty in data space, particularly for extrapolation.
- is defined as the difference between the mean replica and the of the mean prediction, and can be expressed through the prediction covariance projected onto data space.
Optimal hyperparameters minimize and maximize , with equally-good hyperparameter sets selected for replica ensemble training (Cruz-Martinez et al., 21 Oct 2024).
Hyperparameter Optimisation
The NNPDF hyperparameter set includes network sizes, optimizer type, clipnorm, learning rate, penalty strengths, and early stopping criteria. Bayesian optimization is performed via the Hyperopt package, guided by the aforementioned ensemble metrics, and is fully parallelized across hardware accelerators. Hyperparameter scans are distributed asynchronously over multiple GPUs (e.g., via MongoDB and Hyperopt), achieving near-linear walltime reductions with increased throughput.
4. GPU Implementation and Scaling
The major computational breakthrough is the GPU-optimized, stacked-replica workflow:
- All neural networks are embedded in a single large model, sharing a pass through the FK-table convolution and evolution. Distinctions between training and validation sets are enforced via per-replica masks in the computation graph.
- FastKernel tables are restructured via PineAPPL to unify -grid points across all datasets, reducing memory requirements by a factor (e.g., rather than interpolation nodes).
- Tensor contractions are reordered for memory efficiency; with replica indices contiguous and contraction order optimized.
- The net result is a performance increase by : replicas/hour (16-CPU) versus replicas/hour (single H100 GPU).
- GPU memory usage saturates at GB for , a factor reduction over previous implementations, with energy consumption drops of – and compute-hour cost reductions of – as is increased.
- This architecture is generalizable to any ensemble-based deep learning task requiring parallelized hyperparameter and model training (Cruz-Martinez et al., 21 Oct 2024).
5. Uncertainty Quantification and Validation
Uncertainties are interpreted strictly in terms of replica ensemble variance, with no implicit linear-Gaussian assumptions:
- For any physical observable or PDF combination , the predictions define the empirical mean and standard deviation.
- Correlated variations (e.g., scans, theory systematics) are built by correlated seeding and shared data partitions.
- Filtering of non-convergent or outlier fits (e.g., discarding the highest decile) prior to ensemble averaging further stabilizes uncertainty bands without introducing bias.
- Closure tests, where pseudo-data generated from a known PDF set are re-fitted, confirm that coverage and bias are under precise control.
- For integrated proton/nuclear fits, the ensemble captures interpolation and extrapolation errors across the and domains; oscillatory features or larger errors in some regions reflect data density and rank-deficiency, not methodological artifacts.
6. Physical Results, Generalisation, and Impact
- The NNPDF4.0 hyperparameter-optimized fits (nnpdf40_newhyperopt) deliver global fit quality with (to be compared with $1.16$ for baseline NNPDF4.0) and essentially identical breakdowns for various process subsets (DIS, Drell–Yan, jets, , ).
- Training and validation losses are unchanged; average training length (epochs) is reduced by .
- The ensemble estimator increases by , indicating a modest increase in PDF uncertainties. In the most constrained regions (–), light quark uncertainties grow by $5$–, while gluon and charm remain unchanged.
- LHC quark–antiquark and quark–gluon luminosity uncertainties grow by at low masses, but central values are statistically unchanged.
- Computational throughput gains and energy/cost savings are of order for typical global-fit-scale production runs.
- The fully ensemble-based strategy and GPU implementation are transparent and reproducible, supported by open-source code, detailed documentation, and robust validation pipelines (Ball et al., 2021).
7. Extensions and Future Directions
Recent and ongoing NNPDF research programs include:
- Support for approximate N³LO evolution via constraint-based four-loop splitting kernels and uncertainty band prescriptions (MHOU/IHOU) (Hekhorn et al., 2023).
- Simultaneous, integrated proton-deuteron-nucleus PDF fits, placing all -dependence on optical parity with and extending the neural network to composite inputs (Rabemananjara, 2023); global minimizations in this space are ongoing.
- General applicability of the hyperparameter-tuned, GPU-parallel framework to other deep-learning tasks involving large ensembles and bias/variance trade-offs.
- Systematic inclusion of resummation (small-, threshold), mixed QCD-QED effects, theory-covariance from missing higher orders, and direct fits to lattice and other nonperturbative data.
The NNPDF framework constitutes a scalable, flexible, and transparently validated solution for large-scale, uncertainty-aware inference in QCD and particle physics, with modularity and generalisability for other ensemble-based statistical problems in computational science and deep learning.