LibTorch-based Coupling for Weather Prediction
- The method integrates TorchScript models directly into legacy Fortran codes using a C++ shared library and ISO_C_BINDING, bypassing Python overhead.
- It optimizes data exchange and memory management by reshaping Fortran arrays and reusing buffers, achieving up to 8× speedup in operational settings.
- Stability safeguards like output renormalization and clipping maintain physical realism, ensuring reliable long-term weather forecasts during extended integrations.
A LibTorch-based coupling method enables direct and efficient integration of TorchScript-serialized deep learning models into high-performance scientific software ecosystems, particularly those implemented in Fortran. In the context of operational numerical weather prediction, such methods facilitate the replacement of computational bottlenecks—such as physical radiation schemes—with neural network emulators without spawning extraneous Python processes. This approach is exemplified in the embedding of a deep-learning-based radiation parameterization within the China Meteorological Administration’s Global Forecast System (CMA-GFS), yielding significant computational acceleration while maintaining accuracy and long-term stability (Jing et al., 20 Jan 2026).
1. Architectural Overview and Software Integration
The LibTorch-based coupling method for CMA-GFS addresses the intrinsic challenges of integrating modern ML components—trained and archived as TorchScript modules—into predominantly Fortran-based legacy codes. The workflow begins with off-line ML model training, archiving the inference graph using TorchScript. The serialized model is then compiled into a C++ shared library utilizing LibTorch, exposing C-ABI interface functions for initialization and inference. These functions are invoked from the Fortran host physics via ISO_C_BINDING, completely bypassing Python and minimizing runtime dependencies.
The directory structure is as follows:
| Directory | Purpose | Key Files |
|---|---|---|
| include/ | C++–Fortran glue-interface headers | torch_adapter.hpp |
| src/ | C++ routines for ML inference | torch_adapter.cpp, ml_inference.cpp |
| fortran/ | Fortran stubs, physics wrappers | rrtmg_ml.f90 |
| lib/TorchScript/ | Serialized TorchScript archive | rrtmg_ml.pt |
| build/ | Out-of-source CMake build | (build artifacts) |
CMake configuration includes linking against Torch libraries and integrating interface headers. The Fortran build further links against the ML library and includes necessary module interfaces for robust and portable coupling (Jing et al., 20 Jan 2026).
2. Data Exchange, Array Layout, and Memory Management
The numerical weather prediction model dispatches batches of vertical atmospheric columns per physics step to the ML emulator. Each column is characterized by inputs stacked into 22 (shortwave, SW) or 20 (longwave, LW) channels across 89 vertical levels:
- For SW: floats per column.
- Outputs (fluxes/heating rates): floats per column.
Given Fortran’s column-major array layout, raw data is reshaped into two-dimensional arrays. These are passed to C++ as contiguous float32 memory buffers without explicit copy or transpose, matched via torch::from_blob with the appropriate strides.
For inference, buffer reuse is enforced: host memory for inputs and outputs is allocated once and maintained throughout multi-day integrations. When GPU inference is selected at initialization, paired CUDA tensors are created, and host–device transfer proceeds via pinned memory to minimize copy latency, achieving zero per-timestep heap allocation (Jing et al., 20 Jan 2026).
3. C++/Fortran Interface and Inference Control
The bridge between Fortran and C++ is realized via pure C interface definitions in torch_adapter.hpp:
1 2 3 4 |
extern "C" { void rrtmg_ml_init(const char* model_path, int use_gpu_flag); void rrtmg_ml_infer(int ncol, const float* features, float* outputs); } |
The associated Fortran module binds these procedures with type-safe signatures via ISO_C_BINDING. Initialization is performed at host-model startup, specifying the device (CPU/CUDA). For every physics step, three calls manage the workflow:
1 2 3 |
call rrtmg_ml_pack_inputs(ncol, model_state, features) call rrtmg_ml_infer(ncol, features, outputs) call rrtmg_ml_unpack_outputs(ncol, outputs, rad_state) |
If the ML-based radiation scheme is enabled, the traditional RRTMG Fortran call is bypassed.
Within the C++ backend, inference disables autograd with torch::NoGradGuard, and input data is wrapped, forwarded through module.forward, and output tensors are copied back to host memory by direct pointer transfer. GPU-mode inference leverages resource pre-initialization and multi-threading to maximize throughput (Jing et al., 20 Jan 2026).
4. Stability Safeguards and Physical Constraints
To guarantee consistency and prevent unphysical outputs during long-term integration, output vectors from the ML emulator are renormalized and clipped in physical space:
where denotes re-scaling by per-channel standard deviations and means from training.
Each channel is constrained to , maintaining positive flux directions and bounding maximum values (e.g., ) to suppress outliers.
A secondary consistency check recomputes the heating rate from the predicted flux divergence:
If the discrepancy between this derived value and the network-predicted exceeds 10% of , the physically derived value overwrites the network output. This safeguard reduced the crash rate from approximately 18% to zero over ten-day integrations, demonstrating critical importance for operational stability (Jing et al., 20 Jan 2026).
5. Performance Optimization and Computational Profiling
The coupling method is engineered for high-end computational efficiency. Batch inference is performed on up to 512 columns per call, amortizing TorchScript graph construction and improving device utilization. CPU threading (torch::set_num_threads) is aligned with the host’s OpenMP settings; GPU mode leverages pinned memory via torch::empty_pinned to decrease host-device transfer time by approximately 30%.
Profiling on a 12-core Intel Xeon node with a single V100 GPU (10-day, 12.5-km grid, 60 TB I/O) yields the following operation breakdown:
| Step | % of Time |
|---|---|
| Data packing (Fortran→C) | 8 |
| Host–Device (H2D/D2H) transfer | 12 |
| Inference (module.forward+eval) | 68 |
| Unpacking & clipping | 12 |
Empirically, the ML-based emulator attains a speedup of approximately over standard Fortran RRTMG, with end-to-end coupling overhead kept modest. Device context is pre-warmed at launch, and all memory allocations occur before the forecast time loop (Jing et al., 20 Jan 2026).
6. Operational Robustness and Integration Outcomes
This LibTorch-based coupling strategy provides a fully encapsulated Fortran interface, abstracting C++ interoperation, array-stride adaptation, and device-management entirely from the host model. The resulting system functions as a “drop-in” replacement for the RRTMG radiation scheme, supporting extended (10-day) integrations with zero runtime crashes and no degradation in physical forecast realism. The hybrid Fortran–C++–TorchScript design has proven compatible with operational requirements for real-time reforecasting, supporting stringent production constraints and large-scale data throughput. The approach is broadly extensible to other physics parameterizations and numerical model architectures using similar Fortran-centric codebases (Jing et al., 20 Jan 2026).