PIMfused: Physics-Informed Fusion Architectures

Updated 18 November 2025

PIMfused is a class of architectures that fuse physics principles with hardware, algorithms, and diagnostics to overcome key performance bottlenecks.
It integrates near-bank DRAM-PIM for CNN acceleration, physics-informed multi-instrument fusion for plasma diagnostics, and specialized accelerator designs for LLMs.
The paradigm leverages co-design and analytical modeling to optimize memory cycles, energy, area, and stability across diverse fusion and computational tasks.

PIMfused refers to a class of architectures, methodologies, and diagnostic paradigms in which physics-informed fusion—either of data, computational layers, or hardware resources—is employed to overcome fundamental bottlenecks in physics experiments, deep learning acceleration, and plasma-based fusion systems. The specific term "PIMfused" appears in three distinct domains: near-bank DRAM Processing-in-Memory (PIM) for neural network acceleration, physics-informed multi-instrument data fusion for plasma diagnostics, and as a conceptual shorthand for the spherically symmetric compression of plasma targets in Plasma-Jet-Induced Magneto-Inertial Fusion (PJMIF). In each context, PIMfused represents tight co-design or physical integration to achieve higher fidelity, throughput, or stability than traditional, loosely coupled approaches.

1. PIMfused in Near-Bank DRAM-PIM for Neural Network Acceleration

Near-bank DRAM-PIM with fused-layer dataflow, designated as "PIMfused," directly targets the memory-bandwidth bottleneck in CNN acceleration. Traditional architectures optimized with layer-by-layer dataflow are frequently limited by inter-bank data transfers, which impose substantial energy and latency overheads as intermediate feature maps traverse between banks after each layer.

PIMfused introduces several key architectural elements:

Augmented bank-level PIMcores: Beyond supporting CONV+BN+ReLU, PIMcores in PIMfused integrate pooling and element-wise operations, with each core augmented by a small local buffer (e.g., 128–256 B) to maximize intra-tile data reuse and minimize bank-to-bank communication.
A channel-level global buffer (GBUF) and auxiliary core (GBcore): These allow for the collation and reduced-cost processing of cross-bank data when absolutely necessary.
New command set: Data movement and compute are orchestrated with fine-grained control primitives for buffer–bank and buffer–core transfers.

PIMfused employs a fused-layer dataflow: groups of consecutive CNN layers are "fused" into super-kernels, which are spatially tiled and distributed among PIMcores. Each PIMcore processes its own spatial tile end-to-end through the fused layers, eliminating the need for intermediate data movement between layers for those tiles. When deeper layers (with larger channel dimension and smaller spatial dimension) are reached, the system automatically reverts to layer-by-layer dataflow.

Empirical results on end-to-end ResNet-18 demonstrate that, with 4-bank PIMcores and 32 KB GBUF, PIMfused achieves reductions to 30.6% in memory cycles, 83.4% in energy, and 76.5% in area relative to a GDDR6-AiM-like baseline, with a balanced trade-off between parallelism and buffer provisioning. Increasing the local buffer per PIMcore beyond 256 B yields diminishing returns. Thus, small local and moderate global buffers, fused-layer support in hardware, and careful hybrid dataflow provide the Pareto-optimal point for throughput, energy, and area (Yang et al., 11 Nov 2025).

2. PIMfused as Physics-Informed Meta-Instrument Fusion in Plasma Diagnostics

In the context of integrated fusion diagnostics, PIMfused denotes the physics-informed fusion of heterogeneous meta-instruments—typically encompassing X-ray imagers, neutron detectors, and radiographic probes—into a unified inference system for high-fidelity plasma state reconstruction. This paradigm is instantiated in the PiMiX system.

The PiMiX ("Physics-Informed Meta-Instrument for eXperiments") workflow implements PIMfused by incorporating physics priors at two levels: generating synthetic training data from kinetic or MHD simulations, and embedding hard or soft physics constraints into the neural network loss function. The data fusion architecture processes multiple calibrated signals into feature representations which are fused by end-to-end neural networks, optimized with a composite loss balancing data fidelity and physics-informed regularization.

The mathematical foundation relies on traditional inverse-problem notation,

$\mathscr Y = M\,\mathscr X + \mathscr B,$

where each instrument's measurement operator $M$ and noise statistics $\mathscr B$ are fused implicitly by the learned network. The loss function is

$L(\theta) = \sum_{i=1}^N \|\mathcal F(x_i;\theta) - y_i\|_2^2 + \lambda \mathcal R_{\text{physics}}(\theta),$

where $\mathcal R_{\text{physics}}$ enforces conservation laws or response function constraints.

PIMfused delivers empirical gains in several tasks: neutron detection spatial RMSE improves from ≈1 pixel to 0.2 pixel vs. centroiding; X-ray energy resolution achieves <0.5 keV mean error; multi-instrument geometric inversion reduces radius estimation errors from 4.8%–3.4% (single modality) to 2.7% (fused). Extensions include simulation–experiment data fusion, real-time inference integration, and reinforcement-learning-based experimental design (Wang et al., 2024).

3. PIMfused in Accelerator Design for Transformer and Post-Transformer LLMs

In processing-in-memory architectures for LLM serving, "PIM-fused" signifies a hardware design in which specialized compute engines—State-update Processing Units (SPUs)—are shared between DRAM banks to interleave data access and computation. This approach addresses the inherent memory-bandwidth limit in both transformer attention (requiring all key/value memory) and post-transformer state updates (requiring read and write of full state matrices).

The key innovation in the PIMfused architecture (as exemplified by Pimba) is that each SPU is shared by a pair of DRAM banks, alternating between read and write operations each cycle, thereby sustaining throughput at substantially reduced area. The SPU incorporates a State-update Processing Engine (SPE) that implements multipliers and adders in quantized MX8 format, enabling Pareto-optimal trade-offs between area and inference accuracy.

Performance modeling indicates >4× speedup over baseline GPUs and >2× over traditional time-multiplexed per-bank PIM, with up to 6.3× reduction in attention latency and substantial reductions in state-update latency (Kim et al., 14 Jul 2025).

4. Optimization, Analytical Models, and Trade-off Principles

A characteristic feature across PIMfused systems is the explicit analytical modeling of memory, computation, buffer sizing, and energy trade-offs. For CNN acceleration, cycle count models contrast the conventional layer-by-layer schedule with fused-kernel tiling:

Layer-by-layer memory cycles:

$C_{\text{L2L}}(\ell) \simeq \frac{S_{\text{in}}(\ell)}{B_\text{bus}}(N_\text{banks}-1) + \frac{S_w(\ell)}{B_\text{bank}} + \frac{S_\text{out}(\ell)}{B_\text{bank}}$

Fused-layer cycles:

$C_{\text{fused}}(K) \simeq T \left[ \frac{S_{\text{in, tile}}}{B_\text{bank}} + \sum_{\ell=s}^e \frac{S_{w, \text{tile}}(\ell)}{B_\text{bank}} \right] + \text{reorg}_\text{cost}(K)$

Reusable weights and fast on-core SRAM dramatically elevate the weight-reuse metric $R_w(K)$ . In hardware, the Pareto frontier is traced by varying the degree of bank-PIMcore mapping, local/global buffer dimension, and dataflow. For LLM acceleration, the bank-interleaving principle (one SPU per two banks) is shown to maximize throughput per area under the constraint that each state update requires non-simultaneous read and write access.

5. Physical Foundations and Instability in PJMIF "PIMfused" Targets

In plasma-jet-induced magnetoinertial fusion, the "PIMfused" scenario refers to the spherically convergent implosion of a plasma liner, assembled from multiple supersonic jets, onto a magnetized target. The process is characterized by:

Plasma jets (Mach ~60, v~100 km/s) merging via oblique shocks to form a nearly spherical liner
Liner compressing the target, with front-tracking numerical methods enforcing interfacial continuity.

Instabilities limiting achievable compression are traced to two mechanisms: oblique-shock-induced nonuniformities and Rayleigh–Taylor growth at the liner–target interface. Discrete jets generate up to 30% pressure modulations at the liner edge, seeding perturbations. Fragmentation occurs when bubble-to-target-radius $h_b/R_\text{target}$ reaches $\mathcal{O}(10^{-2})$ . Uniformity improvements scale as $1/\sqrt{N_\text{jets}}$ ; for the geometries studied, gains saturate beyond ~16 jets.

Full MHD effects are not yet included but are projected to provide stabilizing tension, motivating magnetic pre-conditioning and more regular jet standoffs in future experimental campaigns (Samulyak et al., 2015).

6. Applications, Metrics, and Design Guidelines

PIMfused approaches yield practical advances across heterogeneous domains:

In near-bank PIM, >3× memory-cycle reductions and substantial area/energy savings for CNNs.
In plasma diagnostics, super-resolved spatial localization and multi-modal inversion errors below 3%.
In LLM serving, throughput and latency gains that close the bandwidth–compute gap on both transformer and post-transformer models.
In fusion design, rapid mapping of fusion scaling laws via neural emulators, efficient parameter sweeps, and feedback control.

Three consistent design insights are established for near-bank PIM:

Moderate-sized global buffers ( $\geq$ 8–16 KB) are necessary for cross-layer reuse in fused dataflows.
Small ( $\sim$ 128–256 B) local buffers per PIMcore suffice for most intra-tile activation reuse.
The best performance/area point generally involves fewer, larger PIMcores (e.g., 4-bank mapping) with fused-layer support (Yang et al., 11 Nov 2025).

7. Outlook and Unification

PIMfused represents an evolving paradigm of tight, physics-informed integration in hardware, algorithm, and experimental inference workflows. Whether applied to data-driven diagnostics, deep learning acceleration, or plasma physics simulation, the underlying principle is to fuse resources along physical, informational, and computational dimensions using models and constraints derived from underlying physics, resulting in systems that achieve superior efficiency and scalability relative to traditional modular approaches.