Trunk Net: Global Aggregation Architecture

Updated 9 April 2026

Trunk Net is a specialized subnetwork that extracts and aggregates global features, forming the backbone of hybrid deep learning systems.
It decouples shared low-dimensional representations from localized details, leading to faster convergence and improved accuracy in tasks like PINNs and segmentation.
Its design facilitates independent optimization of trunk and branch networks, enabling effective transfer learning and reduced computational costs.

A Trunk Net is a specialized subnetwork or architectural component designed to extract, aggregate, or represent global or dominant features, typically serving as a common representation layer in hybrid neural frameworks. The "trunk" concept is leveraged across a range of deep learning domains—physics-informed neural networks (PINNs), operator learning, instance segmentation, video object segmentation, and multi-view action recognition—to decouple the learning of global, low-dimensional, or shared representations from the learning of local, field-specific, or detailed outputs. Trunk networks are often complemented by one or more branch or collateral networks, with each architectural element optimized for independent aspects of the problem. This separation of concerns yields superior convergence properties, accuracy, transferability, and interpretability compared to monolithic or flat architectures.

1. Core Architectural Paradigms

Trunk Nets are most frequently encountered in three structural motifs:

(a) Trunk-Branch (TB) Architectures:

A global trunk network takes as input low-dimensional spatial (or non-spatial) coordinates and produces a compact feature representation. Multiple branch networks, one per physical field, object, or output modality, then decode this representation into output fields with local or specialized detail. This is the principal paradigm in physics-informed neural networks such as TB-net PINN, and operator learning methods including DeepONet (Xing et al., 21 Jan 2025, Kiyani et al., 2024).

(b) Trunk-Collateral and Divide-and-Conquer Designs:

In tasks with multiple modalities (e.g., motion and appearance in video), a shared trunk network encodes information common to all modalities, while a collateral branch adapts to modality-specific nuances via lightweight adapters (e.g., LoRA in motion streams) (Zheng et al., 8 Apr 2025). A similar theme appears in unite-divide-unite networks for segmentation, where a trunk decoder processes deep, semantic feature maps, while parallel branches manage high-resolution or structure-rich representation (Pei et al., 2023).

(c) Trunk-Branch Contrastive Networks:

For multi-view or multi-instance recognition, the trunk block aggregates cross-view features to build a global embedding, while branch blocks capture fine-grained, view-specific details. Trunk-branch contrastive losses promote complementary learning between global and local representations (Yang et al., 23 Feb 2025).

2. Mathematical Foundations and Layer Design

Trunk Nets are typically dense neural networks (MLPs), convolutional decoder stacks, or transformer-based feature extractors, depending on the modality. Their primary task is to map coordinates or shared input to a latent feature space, parameterizing global basis functions for operator learning or capturing spatially coarse semantic context for segmentation.

For example, in PINN-based TB-nets, the trunk network solves: $\mathcal{T}(\tilde x, \tilde y; \sigma_{\mathcal{T}}):\, (\tilde x, \tilde y)\mapsto \Phi(\tilde x, \tilde y) \in \mathbb{R}^{100}$ with sinusoidal activation in the first hidden layer (to aid high-frequency regime convergence), followed by tanh in subsequent layers (Xing et al., 21 Jan 2025). In DeepONet, the trunk net parametrizes coordinate-dependent basis functions,

$\mathcal{G}(u)(y) \approx \sum_{k=1}^p b_k(u) \tau_k(y)$

where $b_k(u)$ is from the branch and $\tau_k(y)$ from the trunk (Kiyani et al., 2024).

In segmentation, the trunk decoder in UDUN receives a set of deep, channel-reduced feature maps and fuses them through successive upsampling and elementwise summation with convolutional transformations, producing a "trunk probability map" via a sigmoid-activated 3x3 convolution (Pei et al., 2023).

When implemented as a transformer backbone (e.g., SegFormer) in video segmentation, the trunk encodes common structure across modalities with multi-level attention and are optionally modulated with adapters in the collateral branch (Zheng et al., 8 Apr 2025).

3. Roles: Global Structure, Optimization, and Transfer

The trunk network has three primary roles:

Global Feature Aggregation: The trunk is responsible for learning low-dimensional, global, and often physically meaningful structures such as pressure gradients or streamlines in fluid mechanics (Xing et al., 21 Jan 2025), or globally consistent embeddings in multi-view recognition (Yang et al., 23 Feb 2025).
Decoupling Representations: By isolating global structure in the trunk and local detail in branches, architectures avoid interference and optimization entanglement, resulting in faster convergence, more stable loss curves, and lower generalization error. For instance, TB-net architectures achieve lower $\ell_2$ and maximum relative errors compared to monolithic FNN PINNs (e.g., $2 \times 10^{-5}$ vs. $10^{-4}$ for pressure fields) (Xing et al., 21 Jan 2025).
Transfer Learning and Data Efficiency: In operator learning and scientific PINNs, the trunk can serve as a reusable feature extractor; freezing a trained trunk net permits rapid, accurate adaptation to new boundary conditions or problem settings via branch retraining, yielding 2–3× speed-up and improved accuracy in transfer tasks (Xing et al., 21 Jan 2025).

4. Specialized Trunk Variants and Modes of Operation

MLP Trunks:

Standard in PINN, DeepONet, and TB-nets, with depth and width tuned per input dimensionality and problem complexity (e.g., 4–7 hidden layers × 100–1001 neurons, tanh/sin activations) (Xing et al., 21 Jan 2025, Kiyani et al., 2024).

KAN (Kolmogorov-Arnold Network) Trunks:

Replace dense affine layers and activations by edge-wise learnable univariate function modules (e.g., spline or Chebyshev polynomial expansions); enable superior spectral properties and compact representations, especially in sharp-interface problems (Kiyani et al., 2024). Practical deployment requires hyperparameter tuning for stability.

Physics-Informed Trunk Nets:

Incorporate physical loss terms (e.g., phase-field energy functionals) into trunk network optimization, enforcing compliance to underlying PDEs or conservation laws. This approach reduces data requirements (e.g., from 45 down to 10 high-fidelity specimens for fracture), at the expense of greater per-epoch computational cost (Kiyani et al., 2024).

Convolutional and Transformer Trunks:

Applied in vision settings, where the trunk operates either as a decoder for deep, low-resolution feature maps (e.g., successive upsample+fusion cascades in UDUN (Pei et al., 2023), wavelet-enhanced pipelines in WaveInst (Fan et al., 3 May 2025)), or as a transformer-based backbone sharing parameters between modalities (Zheng et al., 8 Apr 2025).

5. Quantitative Performance and Comparative Analysis

Comparative results across application domains demonstrate the efficacy of trunk-based approaches.

Domain & Model	Trunk Type	Convergence & Accuracy	Relative Performance
PINN TB-net (Xing et al., 21 Jan 2025)	4-layer FNN (sin+tanh)	Rel. $\ell_2 \sim 10^{-5}$ (flow), $\sim 10^{-3}$ (temp.), stable errors	Faster, more stable, more accurate than vanilla FNN PINN
DeepONet (Kiyani et al., 2024)	MLP/KAN/phys-informed	Test MAE as low as $1.6 \times 10^{-4}$ (u), $\mathcal{G}(u)(y) \approx \sum_{k=1}^p b_k(u) \tau_k(y)$ 0 (α)	KAN trunks: fewer params, better u; physics trunks: less data needed
Segmentation UDUN (Pei et al., 2023)	Cascade upsampling CNN	$\mathcal{G}(u)(y) \approx \sum_{k=1}^p b_k(u) \tau_k(y)$ 1, HCE $\mathcal{G}(u)(y) \approx \sum_{k=1}^p b_k(u) \tau_k(y)$ 2 = 977	Outperforms upsample-and-sum/deep-seg baselines
WaveInst Trunk (Fan et al., 3 May 2025)	CNN+DWT+AGFM	mAP = 49.6 (mature), 24.3 (juvenile, PoplarDataset)	+9.9 mAP over previous SOTA; edge localization improved
SMTC-Net (Zheng et al., 8 Apr 2025)	Transformer trunk	89.2% J&F (DAVIS-16, UVOS)	State-of-the-art w/ intrinsic saliency integration
TBCNet (Yang et al., 23 Feb 2025)	Multi-view agg. (GAM+CRPB)	State-of-the-art multi-view action recognition	Superior to uni/bi-encoder and non-divided approaches

These results corroborate the hypothesis that explicit separation of global and local representations in trunked architectures yields consistently superior performance.

6. Advanced Applications and Extensions

Operator Learning and Scientific Computing:

Trunk nets underpin neural operator frameworks, enabling data-efficient and transferable surrogate models for PDE-governed systems, including nontrivial sharp-interface phenomena such as crack branching (Kiyani et al., 2024).

Image and Video Segmentation:

Trunk decoders in unified networks efficiently model dominant object regions, enhance trunk completeness (global recall), and coordinate with structure decoders to refine boundaries. Wavelet-based trunk nets further augment edge recovery and detail preservation (Fan et al., 3 May 2025).

Multi-View and Multi-Modal Fusion:

Trunk nets aggregate cross-view or cross-modal features via deformable, attention-based, or gating modules, improving holistic task performance in action recognition and segmentation (Yang et al., 23 Feb 2025, Zheng et al., 8 Apr 2025).

Phenotypic Regression and Forest Inventory:

WaveInst-based Trunk Nets facilitate regression on tree biophysical parameters (e.g., diameter at breast height, plant height) from RGB segmentation masks, enabling rapid digital phenotyping (Fan et al., 3 May 2025).

7. Limitations and Prospects

Empirical analyses indicate that trunk nets:

Can introduce additional computational overhead (notably when fusing multiple scales or in physics-informed learning).
Require domain- and task-specific hyperparameter tuning (particularly for KAN trunks or when balancing loss terms).
May underperform under extreme data sparsity, occlusion, or divergent input distributions unless properly regularized or augmented.

A plausible implication is the increasing need for adaptive, hybrid trunk designs capable of handling multi-modal, multi-resolution and multi-task regimes. Future developments are expected to integrate trunk architectures with advanced multi-modal fusion, lightweight functional parameterizations, and end-to-end optimized training protocols for complex, cross-domain scientific and perception applications.

References

(Xing et al., 21 Jan 2025, Pei et al., 2023, Zheng et al., 8 Apr 2025, Fan et al., 3 May 2025, Yang et al., 23 Feb 2025, Kiyani et al., 2024)