ViDex: Visual Dual-path Extractor

Updated 26 November 2025

ViDex is a dual-path extractor that mimics primate ventral and dorsal streams to separately capture high-level semantics and fine local details.
It integrates semantic vectors from transformer encoders with raw spatial features via dedicated fusion modules for robust multi-domain applications.
Empirical evaluations demonstrate improved CT image enhancement, audio-visual speech extraction, and biologically grounded vision modeling.

The Visual Dual-path Extractor (ViDex) is an architectural paradigm for extracting and fusing complementary streams of visual information—semantic context and fine local detail—drawing direct inspiration from the ventral ("what"/semantic) and dorsal ("where"/spatial-detail) pathways of the primate visual system. ViDex modules have been developed in domains including medical image enhancement, cross-modal fusion for audio-visual speech separation, and biologically grounded computer vision, serving as a unifying framework for parallel, interaction-rich perceptual processing.

1. Foundational Principles and Biological Motivation

ViDex's architectural premise is based on two independent, yet interacting, information channels:

Semantic (Ventral-like) Path: Focuses on abstract, high-level scene understanding or object identity, analogous to the ventral "what" stream in visual neuroscience (Nabila et al., 18 Nov 2025, Choi et al., 2023).
Local/Spatial (Dorsal-like) Path: Preserves fine-scale transitions, edges, and spatial relations, emulating the dorsal "where/how" pathway.

Implementations such as D-PerceptCT and the dual-stream neural model use explicit designs to mirror HVS pathways: D-PerceptCT’s semantic branch exploits semantic priors from large-scale pretrained models (DINOv2), while the local branch emphasizes raw pixel detail. In (Choi et al., 2023), retinal-sampling transforms enable separation of magnocellular (periphery/global) and parvocellular (fovea/local) input patterns, enhancing biological fidelity.

2. Architectural Designs and Modular Structure

ViDex adopts explicit dual-path processing, wherein each modality or scale is handled by a dedicated pathway and merged through purposeful fusion mechanisms. Notable instantiations include:

A. D-PerceptCT (Medical Imaging) (Nabila et al., 18 Nov 2025):

Semantic Feature Extractor Branch (SFEB):
- Receives a replicated-channel input, divides it into non-overlapping 16×16 patches, and feeds patches $P_i$ into DINOv2-based transformer encoders for semantic vector generation $e_i$ and transformer output $t_i$ .
- Outputs are stacked and upsampled back to full spatial resolution, generating a dense semantic map $Z$ .
Local Detail Extractor Branch (LDEB):
- Directly processes the native grayscale input by two 3×3 convolutional layers, generating local feature maps $F_{LDEB}$ without normalization layers.
Feature Fusion Module (2FM):
- Concatenates semantic ( $Z$ ) and local detail ( $F_{LDEB}$ ) features, blending channels via a 1×1 convolution to yield a unified representation $F_{fused}$ suitable for further state-space modeling.

B. Dual-Stream Neural Network (Human Vision Modeling) (Choi et al., 2023):

Retinal Transform Sampling: Generates dual views (global/peripheral and local/foveal) for each fixation, using parametric warping.
Parallel CNN Streams: WhereCNN (spatial attention) and WhatCNN (object classification), both sharing a backbone but diverging at readout (saliency logits and GRU-based category recognition).
Recurrent Interaction: Attention-driven spatial sampling and inhibition-of-return mechanisms govern iterative fixations and stream interactions.

C. Audio-Visual Speech Extraction (Xu et al., 2022):

Dual-Path Attention: Audio and video features are aligned at chunkwise temporal scales, interleaving intra-chunk (modality-specific, short-term) and inter-chunk (cross-modal, long-term) transformer blocks.
FaceNet Embeddings: Visual features are extracted per frame and matched temporally to audio chunks without upsampling, enabling temporally efficient fusion.

3. Mathematical Formulation and Fusion Algorithms

ViDex designs utilize mathematically explicit operations to achieve dual-path separation and interaction. Core computations include:

Patch Embedding and Transformer Encoding (Nabila et al., 18 Nov 2025):

$e_i = E_{\text{patch}}(P_i) \in \mathbb{R}^{192} \ t_i = \text{Transformer}_{12}(e_i) \in \mathbb{R}^{192}$

Local Feature Extraction:

$M_1 = W_1 * I; \quad M_2 = \text{ReLU}(M_1); \quad F_{LDEB} = W_2 * M_2$

Feature Fusion:

$F_{cat} = [Z; F_{LDEB}]; \quad F_{fused} = W_{fus} * F_{cat}$

Audio-Visual Inter-chunk Attention (Xu et al., 2022):

$x_{av} = \mathrm{Attn}(q_{av}, k_v, v_v); \quad x_{va} = \mathrm{Attn}(q_{va}, k_{va}, v_{va})$

WhereCNN/WhatCNN Interaction (Choi et al., 2023):

$S_t(i,j) = \text{SoftMax}(...); \quad h_t = \text{GRU}(h_{t-1}, f_t)$

These fusion algorithms guarantee that semantic and spatial information are processed complementary and merged at full spatial-temporal granularity, whether for enhancement (CT), cross-modal fusion (audio-visual speech), or scene understanding.

4. Training Procedures and Objective Functions

Distinct task-specific losses structurally separate the learning objectives for each path within ViDex frameworks:

Perceptual Metrics (Image Enhancement) (Nabila et al., 18 Nov 2025):
- No ablation isolating ViDex is reported; performance is attributed to the integration of semantic and detailed representations.
- Key perceptual metrics include LPIPS (0.0104 for D-PerceptCT vs. 0.0888 for baseline), ST-LPIPS, DISTS, and PIQE, indicating improved detail retention and perceptual similarity.
Spatial Attention and Object Recognition (Brain Alignment) (Choi et al., 2023):
- Attention loss ( $\mathcal{L}_{\text{att}}$ ): pixel-wise cross-entropy on predicted saliency.
- Object classification loss ( $\mathcal{L}_{\text{obj}}$ ): binary cross-entropy over multi-object labels.
- Combined loss: $\mathcal{L}_{\text{total}} = \lambda_{\text{att}} \mathcal{L}_{\text{att}} + \lambda_{\text{obj}} \mathcal{L}_{\text{obj}}$ .
Speech Separation (Audio-Visual Fusion) (Xu et al., 2022):
- Scale-invariant SI-SNR loss, trained on mixtures of varying speakers, using time-domain mask estimation.

Training schedules, hyperparameters, and regularization techniques are reported in context (e.g., Adam optimizer, learning rate schedules, batch sizes, number of epochs).

5. Empirical Evaluation and Task-Specific Outcomes

ViDex has demonstrated empirically validated performance improvements across domains:

CT Image Enhancement (Nabila et al., 18 Nov 2025):
- Achieves substantial gains in perceptual quality and anatomical integrity, outperforming contemporary methods across all measured metrics on the Mayo2016 dataset.
- t-SNE analysis reveals dose-invariance of DINOv2 embeddings, affirming semantic branch robustness.
Audio-Visual Speech Extraction (Xu et al., 2022):
- SI-SNR improvements: +7 dB over audio-only and naive fusion models.
- Results generalize across mixture sizes (2–5 speakers), with real-time inference and moderate computational requirements.
Biologically Grounded Vision (Choi et al., 2023):
- High saliency prediction accuracy and macro-F1 = 61.0% for multi-object recognition.
- fMRI encoding reveals WhereCNN and WhatCNN branches map preferentially to dorsal and ventral cortices, respectively, with ablative tests confirming the functional necessity of dual-path objectives and recurrence.

6. Context, Significance, and Implications

ViDex provides a computational framework directly inspired by functional segregation in biological vision. Its two-stream strategy enables more robust, adaptive, and interpretable scene analysis by:

Allowing specialized extraction and fusion of semantic and spatial cues.
Facilitating cross-modal fusion where temporal or feature resolution mismatch would otherwise undermine joint modeling (Xu et al., 2022).
Enabling brain-inspired models to closely emulate human patterns of fixation, object recognition, and spatial attention, and aligning with human brain responses under naturalistic viewing (Choi et al., 2023).

A plausible implication is that dual-path extractors such as ViDex may enable generalized improvements in multimodal, spatially-sensitive perceptual tasks beyond the domains tested. Their biologically motivated design may also shed insight on the computational differentiation of dorsal and ventral processing streams in human cortex, and why functional segregation emerges from distinct objective functions.

7. Limitations and Variants

Existing studies do not report direct ablation of ViDex modules isolated from the larger system; thus, performance attribution is indirect. Details of internal fusion mechanisms (e.g., 2FM layers in D-PerceptCT) are often at high level, with some architectural choices dependent on downstream task demands. While empirical evidence for robustness and cross-domain applicability is strong, generalization to novel tasks or data types will require further validation. Control-stream ablations in (Choi et al., 2023) suggest that both input-sampling bias and learning objective separation are necessary for maximal benefits; one without the other is insufficient for strict dorsal/ventral functional alignment.

Key Papers:

"D-PerceptCT: Deep Perceptual Enhancement for Low-Dose CT Images" (Nabila et al., 18 Nov 2025)
"Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction" (Xu et al., 2022)
"A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains" (Choi et al., 2023)