A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

Published 25 Apr 2026 in cs.CV | (2604.23415v1)

Abstract: Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces the DualStreamHybrid framework that leverages modality-specific backbones and a learned projection to address the mismatch between RGB and optical flow inputs.
It compares five fusion strategies, including cross-attention and weighted fusion, demonstrating improvements with accuracies reaching 98.12% on datasets like UCF11 and UCF50.
Results indicate that even imperfect motion encoders can boost appearance-based predictions, with fusion efficacy varying based on dataset complexity.

Heterogeneous Two-Stream Fusion for Video Action Recognition: Architecture, Fusion Strategies, and Empirical Insights

DualStreamHybrid Framework and Architectural Rationale

This paper introduces DualStreamHybrid, a heterogeneous two-stream framework for video action recognition that directly addresses the representational mismatch between RGB appearance and optical flow motion. Unlike traditional two-stream methods that rely on a shared convolutional backbone for both appearance and motion modalities, DualStreamHybrid leverages a pretrained ViT-Tiny/16 for RGB appearance and a MobileNetV2, trained from scratch, for a 20-channel stacked optical flow input. The architecture is modular: after modality-specific encoding, a learned projection maps both outputs to a common 192-dimensional space, paving the way for flexible, downstream fusion.

This design is motivated by the inherent structural discrepancies between the modalities. ViT's global self-attention is suitable for the spatial context in RGB frames, while MobileNetV2's convolutions efficiently exploit optical flow's local smoothness and motion regularities. Importantly, this heterogeneous approach empirically outperforms conventional, backbone-symmetric designs, confirming that architectural choices must be matched to modality structure rather than dictated by implementation convenience.

Figure 1: Overview of the proposed two-stream methodology, including separate preprocessing and encoding for RGB and flow, followed by modality fusion and classification.

Figure 2: Detailed view of DualStreamHybrid. ViT-Tiny encodes a single 224×224 RGB frame; MobileNetV2 ingests a 20-channel interleaved optical flow stack, both mapped to 192-dim features for fusion.

Preprocessing Pipeline and Feature Extraction

The pipeline uniformly samples $N=16$ frames per clip, processes the center frame by ViT-Tiny for the RGB stream, and computes dense optical flow with the Farneback algorithm between $K=10$ consecutive pairs for the motion stream. Flow maps are normalized using TSN's canonical scheme, zero-centered, and stacked, resulting in a compact yet expressive 20-channel input. While the RGB branch exploits robust pretraining on ImageNet-21k, the flow branch is optimized from scratch due to the lack of large, modality-appropriate flow corpora.

Figure 3: Optical flow extraction pipeline demonstrating normalization, interleaving, and motion visualization for u/v channels.

Comparative Analysis of Fusion Strategies

The core empirical contribution is a controlled comparison of five fusion approaches:

Late Fusion: Independent prediction, averaging softmax outputs; parameter-light; no explicit inter-stream feature interaction.
Concatenation Fusion: Concatenates modalities, projects through an MLP; higher capacity, potentially prone to overfitting with limited data.
Cross-Attention Fusion: Attends from RGB to flow features via a multi-head mechanism; leverages ViT's global context to weight motion cues.
Weighted Fusion: Learns two scalar weights (softmaxed) for each modality, yielding a convex combination; highly interpretable, minimal additional parameters.
Gated Fusion: Learns a 192-dim sigmoid gate (per-feature weighting between streams); maximally flexible, but parameter-heavy.

All fusion is performed post-projection, enabling fair ablation: only the fusion operator varies between runs.

Empirical Results and Fusion Efficacy

Experiments on UCF11 (11 classes, 1600 videos) and UCF50 (50 classes, 6681 videos) assess how fusion efficacy scales with dataset complexity:

On UCF11, all dual-stream strategies surpass the RGB-only baseline of 95.94%, with cross-attention fusion achieving the best result at 98.12% accuracy—a statistically significant ~2% improvement, attributable to explicit modeling of appearance-driven motion relevance. Weighted fusion achieves near-equal stream weighting ( $\alpha_{\text{RGB}}=0.507$ , $\alpha_{\text{flow}}=0.493$ ), directly supporting the hypothesis of complementary contribution.
On UCF50, the increased diversity and action granularity elevate the importance of robust appearance encoding. Here, weighted fusion yields a test accuracy of 96.86%—the best among fusion strategies and more stable across both datasets. Fusion weighting shifts in favor of RGB ( $\alpha_{\text{RGB}}=0.554$ , $\alpha_{\text{flow}}=0.446$ ), mirroring the increased discriminability of appearance features in more visually varied action spaces.
Figure 4: Sample frames from UCF11 and UCF50, illustrating differences in action diversity and visual variation.

Figure 5: Top-K accuracy comparison across all fusion strategies shows attention and weighted fusion consistently high on both datasets.

Across both datasets, the flow-only stream remains substantially weaker (max test accuracy 59.99%), hampered by degraded flow quality from video compression artifacts—yet in multimodal setting, even this weak motion encoder measurably augments appearance-based prediction.

Practical and Theoretical Implications

Practical implications include:

The best-performing fusion strategy is dataset-dependent: cross-attention excels for small, less complex action spaces, while weighted fusion is robust when data or class count increases.
Fusion mechanisms with more parameters risk undertraining on limited data; thus, weighted fusion's parsimony is advantageous for small or moderate-sized benchmarks.
Even lightweight motion encoders can meaningfully complement powerful appearance backbones.
Interpretability arises naturally from weighted fusion, with stream weights providing post-hoc insight into modality contribution as a function of task difficulty.

Theoretical implications are equally compelling:

The observed synergy validates the underlying hypothesis that motion, even suboptimally estimated, remains a critical signal orthogonal to appearance.
Optimal fusion is not invariant with respect to dataset scale or modality discriminability; this cautions against overgeneralizing design patterns from one benchmark to another.
The architecture pioneers a modality-matched, modality-specific backbone philosophy for video tasks—a direction that may transcend existing conventions in multimodal neural systems.

Limitations and Future Directions

The study is limited by the use of classical Farneback flow, which is known to degrade under compression. Modern, learning-based flow estimators (e.g., RAFT, PWC-Net) should be investigated as replacements. The single RGB frame input—while appropriate for ablation—does not exploit temporal dependencies potentially modelable by transformer sequences or spatiotemporal convolutions. Dataset-scale experiments on UCF101, Kinetics, or HMDB51 would further stress-test architecture and fusion generalization. Finally, repeated runs with multiple seeds and larger batch sizes would allow assessment of statistical stability.

Conclusion

DualStreamHybrid advances video action recognition by rigorously matching modality structure to architecture, innovating in fusion strategy ablation, and substantiating the complementarity of motion and appearance. Its empirical findings—especially the dataset dependency of optimal fusion—offer clear guidance for both system builders and theoreticians. Going forward, enhancing the motion stream via robust, modern flow estimation and broadening the scope to longer temporal windows and larger datasets will likely yield additional improvements and clarify boundary conditions for two-stream fusion efficacy.

Markdown Report Issue