Dual-Stream Streamline Classification

Updated 1 December 2025

Dual-stream streamline classification is a multimodal approach that fuses geometric and functional data to resolve ambiguities in tractography.
It leverages a pretrained dMRI-specific geometric backbone alongside an fMRI-driven auxiliary stream to capture detailed tract and endpoint features.
Logit-level fusion of dual streams yields statistically significant improvements in weighted F1-score, enhancing neuroanatomical parcellation accuracy.

A Dual-Stream Streamline Classification Framework refers to a neural architecture in which two distinct information pathways are used in parallel, each dedicated to a different input modality or feature source, whose outputs are subsequently fused to inform classification. In the context of tractography streamline classification using diffusion MRI (dMRI) and functional MRI (fMRI) data, this approach jointly leverages geometric properties of streamlines and local functional brain activity at streamline endpoints to achieve enhanced anatomical and functional resolution in white matter tract parcellation. By structurally decoupling geometric encoding and functional signal processing, and fusing their decision-level predictions, dual-stream frameworks can resolve ambiguities that arise when structurally similar pathways support distinct cortical areas or functions.

1. Architectural Principles

The dual-stream design comprises two main computational trunks: (1) a geometric backbone and (2) an auxiliary functional pathway. The geometric backbone is instantiated as a pretrained dMRI-specific deep network (e.g., TractCloud or PointNet), which encodes full streamline trajectories and their local spatial context. Each target streamline, resampled at fixed intervals (e.g., 25 equidistant points), is processed together with features from its nearest neighbors, capturing local tract topology and spatial variability. The per-neighbor context is encoded via an MLP, aggregated (e.g., by max-pooling), and globally summarized (e.g., via a PointNet module) to produce a high-dimensional geometric feature vector, typically 1024 dimensions. An MLP-based head produces logits corresponding to candidate tract subregions (e.g., four somatotopic divisions of the corticospinal tract).

The functional auxiliary stream processes only streamline endpoints. It consists of a geometric encoder for the (x, y, z) coordinates of each endpoint (6-dimensional input, transformed to a learned embedding), and a 1D-CNN encoder for the endpoint BOLD fMRI timeseries. The geometric and functional embeddings for both endpoints are concatenated and mapped, via MLP, to an auxiliary feature vector, which is classified by an additional MLP head.

Decision-level fusion is performed by summing the logits from both the geometric and auxiliary heads prior to softmax. By freezing geometric backbone parameters during joint training, the pipeline ensures stable geometric priors, while gradient flow is confined to the functional branch, preventing noisy fMRI updates from destabilizing well-trained geometric features (Yan et al., 24 Nov 2025).

2. Fusion Strategies

The framework adopts logit-level (score) fusion rather than feature-level or early fusion. For a given streamline, the predicted class probabilities are computed as

$\text{logits}_\text{final} = \text{logits}_\text{backbone} + \text{logits}_\text{auxiliary},$

with the final class label given by

$\hat y = \arg\max_c(\text{logits}_\text{final,\,c}).$

This fusion architecture was selected based on ablation studies, which demonstrated that feature-concatenation (early fusion) led to performance decrease, while logit-level fusion combining both endpoint information and fMRI yielded consistent, statistically significant improvement in weighted F1-score over geometric baselines (p < 0.01). This decision-level integration respects the differing statistical properties and noise models of geometric and BOLD-fMRI inputs (Yan et al., 24 Nov 2025).

A similar paradigm is found in two-stream CNNs for video classification, where spatial and temporal streams are fused at the softmax score level, yielding state-of-the-art performance. Simpler score-level fusion avoids high-dimensional joint feature spaces that advance risk of overfitting and inefficient parameterization, especially in heterogeneous multimodal problems (Ye et al., 2015).

3. Training Objectives, Losses, and Implementation

Supervision is via a single weighted cross-entropy loss,

$L_\mathrm{cls}(y, \hat y) = -\sum_{c=1}^4 w_c\,y_c\,\log\,\bigl(\mathrm{softmax}(\text{logits}_\text{final})_c\bigr)$

where $w_c = \tfrac{N}{C\,n_c}$ balances class labels, $N$ is sample count, $C$ the number of classes, and $n_c$ the count for class $c$ . No additional constraints or regularizers are required beyond this objective.

For backbone pretraining, the protocol fixes the geometric weights during fusion-stage training, consistent with best practices in multimodal learning where base feature extractors (e.g., TractCloud) are optimized on large-scale data and then frozen for downstream multimodal specialization. Typical training employs the Adam optimizer with a cosine annealing scheduler, batch sizes of 512, and 20–30 training epochs per phase (Yan et al., 24 Nov 2025).

Inference is efficient: classifying ~200,000 streamlines per subject requires only seconds on commodity GPUs (NVIDIA RTX 3090), owing to batch processing and the limited number of trainable parameters relative to traditional fully-learned multimodal fusion (Yan et al., 24 Nov 2025).

4. Applications and Empirical Performance

The dual-stream streamline classification framework was applied to the somatotopic parcellation of the corticospinal tract (CST) using a cohort of 100 Human Connectome Project - Young Adult (HCP-YA) subjects. White matter streamlines were extracted using TractSeg tractography, and assigned to four functional subdivisions (leg, trunk, hand, face) based on endpoint cortical ROIs.

Ablation studies compared geometry-only, endpoint-only, fMRI-only, early-fusion, and full dual-stream logit-fusion variants on PointNet and TractCloud backbones. Results indicated that only the complete dual-stream fusion (using both endpoint geometry and fMRI via logit summation) achieved statistically significant gains over baseline, with mean weighted F1-score improving to 0.9015 ± 0.0098 (TractCloud, 5-fold CV). Feature-level (early) fusion degraded performance, evidencing the importance of decision-level integration.

Comparison with state-of-the-art approaches, including DeepWMA, DCNN, and DMVFC, showed clear advantage for the dual-stream framework:

Model	Fusion Strategy	F1 (mean ± std)
DeepWMA	geometry only	0.8559±0.0089
DCNN	geometry only	0.8785±0.0039
DMVFC	multi-view	0.8705±0.0086
Proposed	logits fusion	0.9015±0.0098

Logits-level fusion of endpoint-focused fMRI with frozen geometric backbones yields parcellations that better respect cortical functional topography, correcting geometric misassignments that arise in structurally ambiguous regions (Yan et al., 24 Nov 2025).

5. Rationale and Theoretical Motivation

Integration of fMRI endpoint information addresses a critical failure mode of geometry-only tract classification, wherein fibers of similar trajectory support functionally distinct cortical areas (e.g., the hand and trunk divisions of CST). The geometric backbone ensures robust encoding of tract shape and context, while the auxiliary stream supplies local functional specificity that can resolve ambiguities induced by geometric degeneracy. Fusing at the logit (decision) level prevents contamination of geometric features by higher-variance BOLD data, effectively making functional corrections "on top" of geometric priors.

This modality-specialized, late-fusion strategy is in line with conclusions from dual-stream visual classification research (Ye et al., 2015), as well as from the SAR-ATR domain, where local physics-based and global visual features are fused through low-rank bilinear models for maximal discriminative synergy (Xiong et al., 6 Mar 2024). In all these cases, early fusion of heterogeneous feature spaces can induce overparameterization and degrade performance, while decision-level fusion better accommodates the distinct inductive biases and uncertainty profiles of constituent modalities.

The dual-stream paradigm generalizes to other domains, including remote sensing, where local physical (e.g., electromagnetic scattering center) and global visual patterns are fused via heterogeneous GNNs and CNNs for synthetic aperture radar object recognition (Xiong et al., 6 Mar 2024). Key similarities include (1) separation of local (structure- or physics-based) and global (visual or functional) cues, (2) the use of domain-specific backbones, and (3) adaptive, usually parameter-efficient, fusion at the decision or joint representation level. These architectures facilitate interpretability and robustness, and support ablation-driven validation of each stream’s unique contribution.

A plausible implication is that further integration of multimodal endpoints—by extending the auxiliary stream to other local functional or cellular markers—could further enhance specificity in high-dimensional neuroanatomical parcellation tasks.

7. Limitations and Design Considerations

Performance gains from dual-stream fusion are contingent on the informativeness and localization of the auxiliary signals. If BOLD-fMRI quality or endpoint localization is poor, auxiliary corrections may introduce noise rather than resolve ambiguity. Freezing the backbone is critical when auxiliary data is comparatively low signal-to-noise, but if functional signal quality improves, end-to-end fine-tuning may be beneficial. Feature-level fusion between heterogeneous modalities was empirically shown to degrade performance, underscoring the necessity of stream specialization and conservative fusion (Yan et al., 24 Nov 2025).

Moderate dropout (0.5–0.7) ensures training stability. Over-parameterization or high dropout in low-data regimes can adversely affect convergence, as shown both in neuroimaging fusion and classic dual-stream video networks (Ye et al., 2015). Fusion parameters should be cross-validated on a per-dataset basis.

For extended technical reports and implementation protocols, see "A Novel Dual-Stream Framework for dMRI Tractography Streamline Classification with Joint dMRI and fMRI Data" (Yan et al., 24 Nov 2025), and relevant comparative works "Evaluating Two-Stream CNN for Video Classification" (Ye et al., 2015), and "LDSF: Lightweight Dual-Stream Framework for SAR Target Recognition by Coupling Local Electromagnetic Scattering Features and Global Visual Features" (Xiong et al., 6 Mar 2024).