Papers
Topics
Authors
Recent
2000 character limit reached

NNPDF Framework: Neural PDF Determination

Updated 14 November 2025
  • NNPDF Framework is a machine-learning system that determines proton, nuclear, and photon parton distribution functions using neural network ensembles and Monte Carlo replicas.
  • It employs ensemble training with GPU acceleration and Bayesian hyperparameter optimization to enhance accuracy and efficiency in PDF extraction.
  • Recent advancements include integrated nuclear fits, approximate N³LO evolution, and significant computational cost and energy reductions for large-scale QCD analyses.

The NNPDF (Neural Network Parton Distribution Function) framework is a comprehensive machine-learning–based system for the nonperturbative determination of proton, nuclear, and photon parton distribution functions (PDFs) from high-energy scattering data, with minimal theoretical bias and robust, data-driven uncertainty quantification. Originating in collider QCD phenomenology, NNPDF employs an ensemble-of-neural-network approach for PDF parametrization, a Monte Carlo method to propagate all sources of experimental and theoretical uncertainty, and a statistically rigorous workflow for cross-validation and hyperparameter optimization. Recent methodological advances have included GPU-accelerated ensemble training, Bayesian hyperparameter optimization, integrated treatment of proton and nuclear PDFs, and the support for approximate N³LO DGLAP evolution.

1. Theoretical Foundations

At the core of NNPDF is the view of PDF determination as an inference problem in function space, where the observable is not a finite-dimensional parameter vector but a distributional functional, f(x,Q2)f_\ell(x,Q^2), for each parton flavor \ell, as a function of momentum fraction xx and energy scale Q2Q^2. The extraction is accomplished via an ensemble of NrepN_{\text{rep}} Monte Carlo replicas, each representing a possible realization of the PDFs consistent with experimental data and their uncertainties:

f(k)(x,Q2),k=1Nrepf_\ell^{(k)}(x,Q^2),\quad k=1\dots N_{\text{rep}}

The ensemble mean and variance give, respectively, the PDF central value and one-sigma uncertainty:

f(x,Q2)=1Nrepkf(k)(x,Q2)\langle f_\ell(x,Q^2) \rangle = \frac{1}{N_{\text{rep}}}\sum_k f_\ell^{(k)}(x,Q^2)

σ(x,Q2)=1Nrep1k(f(k)(x,Q2)f)2\sigma_\ell(x,Q^2) = \sqrt{\frac{1}{N_{\text{rep}}-1} \sum_k \left(f_\ell^{(k)}(x,Q^2) - \langle f_\ell \rangle\right)^2 }

PDF evolution with Q2Q^2 is performed by solving the DGLAP equations, up to NNLO or, in recent extensions, approximate N³LO (aN³LO) using constraint-based parametrizations of the O(αs4)O(\alpha_s^4) splitting functions. The aN³LO evolution includes both missing-higher-order uncertainties (MHOU) from renormalization/factorization scale variation and incomplete-higher-order uncertainties (IHOU) from ansatz spread, combined in the theory covariance matrix for full error propagation (Hekhorn et al., 2023).

In the integrated framework for nuclear and proton PDFs, the atomic mass number AA is included as an explicit network input, allowing a single neural network to learn the xx– and AA–dependence of the distributions up to lead (A=208A=208) (Rabemananjara, 2023).

2. Neural Network PDF Parametrization

Each PDF is represented at the input scale Q0Q_0 as

f(x,Q0)=Axα(1x)βN(x;θ)f_\ell(x, Q_0) = A_\ell\, x^{\alpha_\ell} (1-x)^{\beta_\ell}\, N_\ell(x; \theta)

where {A,α,β}\{A_\ell,\,\alpha_\ell,\,\beta_\ell\} are normalization and preprocessing exponents (possibly AA dependent in the nuclear case), and N(x;θ)N_\ell(x; \theta) is a feed-forward neural network:

  • Input: xx (and optionally lnx\ln x, AA),
  • Hidden layers: typically n1=25n_1=25, n2=20n_2=20 per flavor in NNPDF4.0, with tanh\tanh activation,
  • Output: linear activation,
  • The total number of weights and biases per PDF is typically 200\sim 200.

Preprocessing exponents α,β\alpha_\ell, \beta_\ell are chosen randomly within broad intervals and can be further optimized as hyperparameters. Sum rules (momentum conservation, valence) and PDF positivity are enforced via Lagrange-multiplier penalty terms appended to the loss function:

E(k)(θ)=1ndati,j=1ndat(Di(k)Ti(k)(θ))[Cov1]ij(Dj(k)Tj(k)(θ))+iλiΦi(θ)E^{(k)}(\theta) = \frac{1}{n_{\text{dat}}}\sum_{i,j=1}^{n_{\text{dat}}}(D_i^{(k)} - T_i^{(k)}(\theta))\, [\operatorname{Cov}^{-1}]_{ij}\, (D_j^{(k)} - T_j^{(k)}(\theta)) + \sum_i \lambda_i \Phi_i(\theta)

The penalty strengths λi\lambda_i themselves are subject to hyperoptimization (Cruz-Martinez et al., 21 Oct 2024).

3. Statistical and Computational Methodology

Monte Carlo Replica Procedure and Loss Function

NNPDF generates NrepN_{\text{rep}} artificial data replicas D(k)D^{(k)}, each constructed by fluctuating the experimental measurements according to the full covariance matrix (statistical + systematic + normalization errors). For each replica, a separate network is trained by minimizing a regularized loss,

χ2(θ(k))=[D(k)T(θ(k))]Cov1[D(k)T(θ(k))]+λR(θ(k)),\chi^2(\theta^{(k)}) = [D^{(k)} - T(\theta^{(k)})]^{\top} \operatorname{Cov}^{-1} [D^{(k)} - T(\theta^{(k)})] + \sum_\ell \lambda_\ell\,R_\ell(\theta^{(k)}),

where T(θ)T(\theta) is the theoretical prediction for the relevant observables, obtained by convoluting the neural-net parameterized PDFs with precomputed FastKernel (FK) tables for rapid evaluation of DGLAP evolution and hard cross-sections (Ball et al., 2021).

After minimization, postfit filtering discards outlier replicas (e.g., worst 10% by χ2\chi^2), stabilizing ensemble statistics.

Cross-Validation, Early Stopping, and Ensemble Metrics

Crucial to avoiding over- and underfitting is cross-validation. Each replica's data are split into training and validation sets (e.g., via kk-fold, K4K\sim4), and training is halted when the validation loss ceases to improve. Model selection employs ensemble-based statistical estimators:

  • First-moment metric (generalization χpdf2\chi^2_{\text{pdf}}): averages both fit quality and PDF spread.
  • Second-moment metric (φ2\varphi^2): measures PDF uncertainty in data space, particularly for extrapolation.
  • φχ22\varphi^2_{\chi^2} is defined as the difference between the mean replica χ2\chi^2 and the χ2\chi^2 of the mean prediction, and can be expressed through the prediction covariance projected onto data space.

Optimal hyperparameters minimize L(χpdf2)L^{(\chi^2_{\text{pdf}})} and maximize L(φ2)L^{(\varphi^2)}, with nhopt10n_{\text{hopt}}\approx 10 equally-good hyperparameter sets selected for replica ensemble training (Cruz-Martinez et al., 21 Oct 2024).

Hyperparameter Optimisation

The NNPDF hyperparameter set η\eta includes network sizes, optimizer type, clipnorm, learning rate, penalty strengths, and early stopping criteria. Bayesian optimization is performed via the Hyperopt package, guided by the aforementioned ensemble metrics, and is fully parallelized across hardware accelerators. Hyperparameter scans are distributed asynchronously over multiple GPUs (e.g., via MongoDB and Hyperopt), achieving near-linear walltime reductions with increased throughput.

4. GPU Implementation and Scaling

The major computational breakthrough is the GPU-optimized, stacked-replica workflow:

  • All NrepN_{\text{rep}} neural networks are embedded in a single large model, sharing a pass through the FK-table convolution and αs\alpha_s evolution. Distinctions between training and validation sets are enforced via per-replica masks in the computation graph.
  • FastKernel tables are restructured via PineAPPL to unify xx-grid points across all datasets, reducing memory requirements by a factor 80\sim 80 (e.g., nx50n_x\sim 50 rather than 4000\sim 4000 interpolation nodes).
  • Tensor contractions are reordered for memory efficiency; with replica indices contiguous and contraction order optimized.
  • The net result is a performance increase by O(100)\mathcal{O}(100): 0.5\sim 0.5 replicas/hour (16-CPU) versus 60\sim 60 replicas/hour (single H100 GPU).
  • GPU memory usage saturates at 9\sim 9GB for Nrep=100N_{\text{rep}}=100, a factor 2\sim2 reduction over previous implementations, with energy consumption drops of 78%78\%91%91\% and compute-hour cost reductions of 45%45\%55%55\% as NrepN_{\text{rep}} is increased.
  • This architecture is generalizable to any ensemble-based deep learning task requiring parallelized hyperparameter and model training (Cruz-Martinez et al., 21 Oct 2024).

5. Uncertainty Quantification and Validation

Uncertainties are interpreted strictly in terms of replica ensemble variance, with no implicit linear-Gaussian assumptions:

  • For any physical observable or PDF combination O\mathcal{O}, the NrepN_{\text{rep}} predictions define the empirical mean and standard deviation.
  • Correlated variations (e.g., αs\alpha_s scans, theory systematics) are built by correlated seeding and shared data partitions.
  • Filtering of non-convergent or outlier fits (e.g., discarding the highest χ2\chi^2 decile) prior to ensemble averaging further stabilizes uncertainty bands without introducing bias.
  • Closure tests, where pseudo-data generated from a known PDF set are re-fitted, confirm that coverage and bias are under precise control.
  • For integrated proton/nuclear fits, the ensemble captures interpolation and extrapolation errors across the AA and xx domains; oscillatory features or larger errors in some regions reflect data density and rank-deficiency, not methodological artifacts.

6. Physical Results, Generalisation, and Impact

  • The NNPDF4.0 hyperparameter-optimized fits (nnpdf40_newhyperopt) deliver global fit quality with χ21.15\chi^2 \approx 1.15 (to be compared with $1.16$ for baseline NNPDF4.0) and essentially identical breakdowns for various process subsets (DIS, Drell–Yan, jets, W/ZW/Z, ttˉt\bar{t}).
  • Training and validation losses are unchanged; average training length (epochs) is reduced by 15%\sim 15\%.
  • The ensemble φχ22\varphi^2_{\chi^2} estimator increases by 10%\sim 10\%, indicating a modest increase in PDF uncertainties. In the most constrained regions (x103x\sim 10^{-3}10110^{-1}), light quark uncertainties grow by $5$–10%10\%, while gluon and charm remain unchanged.
  • LHC quark–antiquark and quark–gluon luminosity uncertainties grow by 10%\sim 10\% at low masses, but central values are statistically unchanged.
  • Computational throughput gains and energy/cost savings are of order >50%>50\% for typical global-fit-scale production runs.
  • The fully ensemble-based strategy and GPU implementation are transparent and reproducible, supported by open-source code, detailed documentation, and robust validation pipelines (Ball et al., 2021).

7. Extensions and Future Directions

Recent and ongoing NNPDF research programs include:

  • Support for approximate N³LO evolution via constraint-based four-loop splitting kernels and uncertainty band prescriptions (MHOU/IHOU) (Hekhorn et al., 2023).
  • Simultaneous, integrated proton-deuteron-nucleus PDF fits, placing all AA-dependence on optical parity with xx and extending the neural network to composite inputs (x,A)(x, A) (Rabemananjara, 2023); global minimizations in this space are ongoing.
  • General applicability of the hyperparameter-tuned, GPU-parallel framework to other deep-learning tasks involving large ensembles and bias/variance trade-offs.
  • Systematic inclusion of resummation (small-xx, threshold), mixed QCD-QED effects, theory-covariance from missing higher orders, and direct fits to lattice and other nonperturbative data.

The NNPDF framework constitutes a scalable, flexible, and transparently validated solution for large-scale, uncertainty-aware inference in QCD and particle physics, with modularity and generalisability for other ensemble-based statistical problems in computational science and deep learning.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to NNPDF Framework.