Multi-Input Convolutional Neural Networks

Updated 13 December 2025

Multi-input CNNs are neural architectures that integrate multiple data streams—such as different imaging views or signal modalities—through dedicated branches and fusion techniques.
They employ various fusion strategies, including early, mid, and late fusion, to optimally combine features from heterogeneous inputs and enhance network performance.
These models have demonstrated improved accuracy and generalization in applications like biomedical signal processing, computer vision, and video analysis despite an increase in computational cost.

A multi-input convolutional neural network (CNN) is an architectural paradigm in which more than one stream of input—corresponding to different views, modalities, scales, representations, or instances—are processed either in parallel or through specialized branches. The resulting intermediate representations are then fused, typically via concatenation, addition, weighted summation, or specialized feature-aggregation mechanisms. Multi-input CNNs have been developed to address the limitations of single-stream models in scenarios where the joint exploitation of heterogeneous or multi-view data yields improved statistical efficiency, feature expressivity, generalization, and downstream performance.

1. Architectural Taxonomy and Design Patterns

Multi-input CNNs encompass a heterogeneous spectrum of designs, distinguished by their input modalities, branching and fusion strategies, and the semantics of the input streams. Key variations include:

Early-fusion models: Multiple inputs are stacked as distinct channels and processed through shared convolutional layers from the first layer onward. For example, in multi-lead ECG analysis, dual-lead signals are combined into a (batch, 1, 512, 2) tensor, analogous to color channels in images, enabling the same convolutional kernel to integrate cross-lead information throughout the network depth (Tung et al., 2020).
Parallel-branch (mid-fusion) models: Separate convolutional branches operate independently on different input streams or input transformations up to a certain network depth, with feature-wise fusion (concatenation, summation) at latent representations or directly before final prediction layers. Cuboid-Net exemplifies this approach by slicing video cuboids along temporal, vertical, and horizontal axes, each branch operating on one slice direction, followed by multi-branch 3D reconstruction and quality enhancement modules (Fu et al., 24 Jul 2024).
Late-fusion/ensemble-like models: Entire subnetworks are duplicated, each acting on a stochastic or semantic variant of the input. Outputs are fused at the embedding or prediction layer. MultiPodNet ("TripodNet") uses independently parameterized ResNet branches (pods), each receiving an identically sourced image with differing data-augmentations, and concatenates the learned features prior to the final classification layer (Pan et al., 2022). Soft-adaptive late fusion is also seen in methods that combine penultimate activations from multiple pre-trained ConvNets via loss-based weighting (Akilan et al., 2017).
Hybrid and adaptive fusion: Some architectures dynamically modulate feature fusion with attention, information-based weighting, or explicit scale invariance. For instance, the Information-based Selective Excitation (ISE) block in multi-lead ECG networks computes attention weights via per-channel entropy and calibrates channel importance throughout the residual stack (Tung et al., 2020).

2. Mathematical Formalisms and Fusion Mechanisms

Multi-input CNNs employ mathematically explicit strategies for information fusion across input streams:

Stacked-channel fusion:

For inputs $x^{(1)}, x^{(2)},\dots,x^{(m)}$ of compatible dimensions, spatial stacking yields a composite tensor processed by standard 2D/1D convolutions. In 1D ECG classification,

$\text{Input tensor: } \boldsymbol{X} \in \mathbb{R}^{\text{batch} \times 1 \times 512 \times m},$

enabling the extraction of joint representations from multi-lead signals using conventional kernel operations (Tung et al., 2020).

Parallel-branch feature extraction and fusion:

For $m$ parallel subnetworks $f_i(x^{(i)})$ ,

$h = \mathrm{concat}\big( f_1(x^{(1)}), \dots, f_m(x^{(m)}) \big),$

followed by

$y = \mathrm{softmax}(W h + b),$

as in MultiPodNet (Pan et al., 2022). Adaptive weighting can be introduced by computing branch-specific cross-entropy losses $L_i$ and normalizing branch contributions:

$w_i = \frac{e^{-L_i}}{\sum_j e^{-L_j}}, \quad \mathbf{F} = \sum_{i=1}^m w_i \mathbf{f}_i$

and feeding $\mathbf{F}$ to a classifier (Akilan et al., 2017).

Late fusion via global pooling and regression: Multi-view networks like TumorNet combine 2D projections by median intensity along each plane and stack as channels, producing a 3-channel input that is processed by a monolithic CNN, with downstream regression (e.g., via Gaussian Processes) on the resultant high-level embedding (Hussein et al., 2017).
Attention-based channel calibration:

In ISE blocks, per-channel feature entropy $e_c$ is computed,

$e_c = -\sum_{j} \tilde U_{c,j}\log \tilde U_{c,j},\quad \tilde U_{c,j} = \frac{e^{U_{c,j}}}{\sum_k e^{U_{c,k}}}$

and the network learns $p_c$ attention weights to amplify/inhibit $U_{c,j}$ accordingly (Tung et al., 2020).

3. Applications Across Domains

The multi-input CNN paradigm demonstrates notable effectiveness in diverse scientific and engineering domains:

Biomedical signal processing: Multi-lead ECG classification employs dual-lead input, fused via channel-wise conv kernels and information-based attention, achieving superior sensitivity and precision, especially for VEB and SVEB classification—demonstrating that multi-lead inputs increase discriminative dimensionality and generalization for arrhythmia detection (Tung et al., 2020).
Computer vision and remote sensing: Mid- and late-fusion frameworks aggregate information from multiple ConvNets (e.g., AlexNet, VGG-16, Inception) to realize competitive or superior object and scene classification accuracy across datasets, with fusion weights dynamically adapted by per-branch loss (Akilan et al., 2017). Integration of derived auxiliary representations (e.g., pixel gradients, texture maps) with raw images via parameter sharing or additive fusion results in increased test accuracy across MNIST, CIFAR-10, and CIFAR-100 benchmarks, with minimal parameter overhead (Pandey et al., 2020).
Video processing: Directional slicing of video tensors and multi-branch feature extraction enable joint spatial-temporal super-resolution that surpasses prior approaches in quantitative performance (e.g., PSNR gains on Vimeo-90K and Vid4 test sets), indicating that directional multi-input feature streams capture orthogonal spatio-temporal regularities (Fu et al., 24 Jul 2024).
Astronomical time series and image-event fusion: Multi-input models ingest both convolutionally encoded images (e.g., light-curve phase folds, difference images) and scalar/tabular features, yielding classification pipelines for variable stars (Szklenár et al., 2022) and transient astronomical events (Rehemtulla et al., 2023) that outperform single-modality baselines in completeness, speed, and interpretive disambiguation of visually confounded classes.
Multi-scale and multi-instance learning: The scale-diverse regime is addressed by dedicated subnets for each quantized input scale and a shared deep trunk, as realized in the Multi-scale Unified Network, with scale-invariant loss enforcing representation consistency across scale (Liu et al., 27 Mar 2024). Flexible multi-instance grouping permits sub-bag–aware embeddings and robust handling of missing data (Stec et al., 2018).

4. Performance and Comparative Analyses

Systematic evaluation of multi-input CNNs reveals consistent advantages in classification, detection, or regression metrics over their single-input counterparts, though often at the cost of increased parameter count or computational footprint:

Incremental accuracy gains: TripodNet (3-pod MultiPodNet, m=3) improves CIFAR-10 test accuracy from 91.66% (ResNet-20) to 92.47%, and ImageNet top-1 accuracy from 69.99% (ResNet-18) to 70.89%, with parameter cost scaling linearly with m (Pan et al., 2022).
Enhanced generalization: In the multi-lead ECG setting, dual-lead input yields higher sensitivity and precision for clinically relevant arrhythmic event detection. The incorporation of information-based channel attention further increases both VEB and SVEB metrics, with ISEnet-14 exceeding or matching state-of-the-art performance on standardized AAMI-compliant data splits (Tung et al., 2020).
Complementarity and error correction: Fusing morphological (image) and tabular (scalar) features in variable star classification leads to smoother learning curves, reduced validation loss, and marked empirical gains for previously confounded subtypes (Szklenár et al., 2022). The multi-modal approach in BTSbot increases completeness to 99.1%, compared to 95% for human experts, and reduces classification latency by approximately 7 hours on ZTF survey alerts (Rehemtulla et al., 2023).
Computational efficiency: When inputs are naturally at multiple spatial scales, multi-scale subnets with a unified trunk (MSUN) reduce FLOPs by 7–16% during inference at small scales, while substantially improving average and small-scale top-1 accuracy (+6.4% over vanilla baselines) (Liu et al., 27 Mar 2024).
Ablation and saturation: In MultiPodNet, three parallel pods yield the best trade-off between accuracy and parameter cost; extra pods produce diminishing returns, while single- or dual-branch set-ups underexploit available representational diversity (Pan et al., 2022). Similar diminishing return patterns are observed in multi-scale subnetwork count for MSUN (Liu et al., 27 Mar 2024).

5. Robustness, Regularization, and Data Efficiency

Multi-input CNNs confer several robustness and generalization benefits fundamental to modern deep learning pipelines:

Cross-modal and multi-view regularization: Shared-weight architectures for multiple representations (e.g., image + gradient) force convolutional filters to represent generalizable structure, mitigating overfitting and enhancing robustness to input-specific artifacts (Pandey et al., 2020).
Missing data handling: Nested multi-instance architectures with group-level aggregation and missing-instance fill-in (optimized neutral input, sub-bag dropout) sustain high accuracy even with incomplete input groups, a common occurrence in real-world sensor and medical imaging pipelines (Stec et al., 2018).
Auxiliary supervision and attention: Explicit channel-wise recalibration based on information-theoretic measures (entropy, mean square deviation) adaptively amplifies salient inputs per instance, improving discrimination between fine-grained classes and providing diagnostics on class separability (Tung et al., 2020).

6. Practical Implementation Considerations

Implementation of multi-input CNNs requires attention to several system-level and methodological factors:

Parameter sharing versus independence: Some frameworks share kernel weights across input representations (e.g., image and gradient (Pandey et al., 2020)), minimizing overhead, while others replicate full weight sets for maximal feature diversity (e.g., MultiPodNet (Pan et al., 2022)).
Fusion point and method: The fusion strategy—early versus late; sum, concat, or learned adaptive weighting—affects both computational efficiency and representational synergy (Akilan et al., 2017).
Training protocol and data augmentation: Multi-input streams may necessitate input-specific normalization, augmentation pipelines, or synthetic generation (e.g., GP-based phase-curve upsampling for class balancing (Szklenár et al., 2022)), and training must synchronize optimization across all input-dependent branches.
Modular extensibility: Several multi-input CNNs (e.g., MultiPodNet, MSUN) are compatible with standard backbone architectures, enabling straightforward scaling to ever-larger or more heterogeneous sets of input views or modalities.
Deployment considerations: Models such as dual-lead ISEnet are shown to remain lightweight and suitable for real-time, resource-constrained applications (e.g., Holter ECG monitoring), with only marginal increase in FLOPs over single-input variants (Tung et al., 2020).

7. Open Problems and Future Directions

Despite empirical success, several questions and challenges remain in multi-input CNN research:

Optimal fusion strategies: Identifying universal principles for selecting between early, mid, and late fusion—and for weighting, attention, and adaptation mechanisms—remains an open area, especially in cross-domain or heterogeneous modality scenarios.
Sample efficiency and parameter efficiency: While increased input streams improve accuracy, parameter efficiency trade-offs are significant in paradigms like MultiPodNet. Scalable alternatives, such as feature-sharing or joint embedding learning, are active research directions (Pan et al., 2022, Pandey et al., 2020).
Understanding feature complementarity: While improvements are attributed to increased "diversity" and representational capacity, quantitative metrics for measuring complementarity and its relation to generalization are still underdeveloped.
Robustness to missing/partial inputs: Further research into principled fill-in, dropout, and aggregation for incomplete input groups can substantially improve deployment in real-world sensor networks and variable data integrity domains (Stec et al., 2018).
Extending to continual and lifelong learning: Systematic fusion strategies for dynamically evolving or growing input modalities, as well as tasks requiring online adaptation to new modalities, represent future directions for neural architecture research.