AVP-Fusion Frameworks

Updated 1 January 2026

AVP-Fusion is an adaptive multi-modal fusion framework that integrates heterogeneous sensor, feature, or modality streams to enhance accuracy, safety, and robustness across diverse domains.
It employs advanced techniques such as spatial/channel adaptive modules, cross-modal alignment, contrastive learning, and hybrid attention to optimize information integration.
Applications span collaborative BEV perception in automated valet parking, antiviral peptide identification, audio-visual pretraining, quality assessment, and adversarial attack design with significant empirical gains.

AVP-Fusion encompasses a class of adaptive multi-modal fusion frameworks designed for diverse domains, notably collaborative perception in automated valet parking, two-stage antiviral peptide identification, audio-visual pretraining for audio generation, robust AV quality prediction, and adversarial attacks in large-scale recommender systems. These frameworks share a common philosophy: optimally integrate heterogeneous sensor, feature, or modality streams via refined fusion architectures—often with adaptive weighting, cross-modal alignment, and contrastive or attention-based learning—to deliver accuracy, safety, or robustness advancements. The most technically mature instantiations are in three principal areas: collaborative and BEV-based automotive perception, antiviral peptide sequence modeling, and hybrid attention video/audio understanding.

AVP-Fusion frameworks in automated valet parking (AVP) exploit multi-sensor and infrastructure-assisted collaborative perception (CP) to overcome line-of-sight limitations, occlusions, and bandwidth constraints in garage environments. These systems operate primarily over unified bird’s-eye view (BEV) representations, integrating camera and LiDAR data from both onboard and roadside infrastructure. The fusion algorithm applies a combination of spatial-adaptive (S-Ada) and channel-adaptive (C-Ada) modules to preserve salient features across modalities. Explicit formulation is as follows (see (Jia et al., 2024)):

Camera-BEV projection: For each image pixel $(u,v)$ and depth $d$ , 3D coordinates are recovered and discretized into BEV grid cells.
LiDAR-BEV projection: Using a PointPillars scheme, vertical point cloud “pillars” are encoded and scattered to BEV.
Feature fusion: Multiple BEV maps (vehicle and infrastructure camera/LiDAR) are adaptively merged:

$X_{\rm S}= \mathrm{Conv3D}\bigl(\bigl[X_{\rm avg}\;;\;X_{\rm max}\bigr]\bigr),\ X_{\rm C}= \sum_{i}w_i\,X_i,\ X_{\rm fuse}= \mathrm{C\text{-}AdaFusion}(X_{\rm S}, X_{\rm C})$

where $X_i$ are input BEV maps, $w_i$ are learned channel weights.

Compression: Channel-wise $1\times1$ convolutions, spatial downsampling, sparsification, and quantization collectively achieve a >380-fold size reduction, yielding per-frame BEV feature maps of 48–72 KB, enabling <1.5 MB/s per modality transmission over NR-V2X (New Radio Vehicle-to-Everything) links.

The consolidated result is a dramatic increase in 3D detection average precision for both cars and pedestrians; infrastructure-assisted fusion raises safe maximum cruising speeds in safety-critical AVP scenarios by up to 3 m/s, confirming real-world impact (Jia et al., 2024).

In computational biology, AVP-Fusion denotes a two-stage framework for antiviral peptide (AVP) identification and subclass classification (Wen et al., 25 Dec 2025). The architecture first constructs a panoramic feature space from peptide sequences using the following schema:

Stage 1: Fuses ESM-2 transformer embeddings (deep context) with ten hand-crafted descriptors (AAC, DPC, etc.). Local motif features are extracted by a 1D CNN; global dependencies are modeled with a two-layer BiLSTM. An adaptive gating mechanism dynamically weighs CNN vs. BiLSTM outputs per sequence:

$\alpha = \sigma(W_g [v_{\text{cnn}}; v_{\text{bilstm}}] + b_g),\ E_{\text{final}} = \alpha \cdot v_{\text{cnn}} + (1-\alpha) \cdot v_{\text{bilstm}}$

Contrastive learning incorporates OHEM-driven hard negative mining and BLOSUM62-based biological augmentation, sharpening margins for sequence discrimination.
Stage 2: Encoder weights are transferred to fine-tune heads for subclass (family, virus) tasks using focal loss and test-time augmentation.

On benchmark datasets, the method attains state-of-the-art accuracy (ACC = 0.9531, MCC = 0.9064) and interpretable peptide representations, with adaptive gating learning when to prioritize local vs. global features (Wen et al., 25 Dec 2025).

3. Audio-Visual Pretraining Fusion in Generative Multimedia

In video-to-audio (V2A) generation, AVP-Fusion arises in the context of SlowFast-Contrastive Audio-Visual Pretraining (SF-CAVP) (Yang et al., 24 Sep 2025). The approach targets the intricate semantics and swift temporal cues in multi-event scenarios:

Dual-stream encoding: Both audio and video are decomposed into S segments per input and run through parallel Slow (low temporal, high channel, core semantics) and Fast (high temporal, low channel, rapid dynamics) pathways.
Lateral fusion: At each stage, time-strided convolutions inject Fast pathway features into the Slow stream:

$X_{\text{slow}}' = X_{\text{slow}} + \phi(X_{\text{fast}})$

Global pooling and projection: Concatenated pooled features are projected to a shared embedding space.
Cross-modal alignment: InfoNCE contrastive loss over segment-level embeddings enforces audio/video synchrony.
Direct preference optimization: This AVP model serves as a reward module in AVP-RPO, directly improving semantic and temporal alignment of the generated audio.

Quantitative ablations reveal up to 10.3% Fréchet Distance reduction in distribution matching and significant gains in temporal synchrony and quality vs. prior art (Yang et al., 24 Sep 2025).

4. Attention-Based AVP-Fusion for Quality Assessment

The term AVP-Fusion is also operationalized in audio-visual quality (AVQ) prediction via hybrid attention mechanisms (Salaj et al., 21 Sep 2025). Here, the approach involves:

Feature extraction: Pretrained GML models provide 512-dim audio embeddings; six per-frame VMAF features are pooled for video, normalized, and projected into the same 512-dimensional space.
Attentive fusion head: Bidirectional cross-modal attention computes context-aware features, concatenated and further refined by self-attention. The output is regressed to a scalar subjective MOS value via a 2-layer MLP.
Modality relevance estimation: Combines ablation sensitivity and feature-change norms to estimate per-sample audio/video importance:

$I^{\text{final}}_m = \alpha \cdot \text{norm}(I^{\text{abl}}_m) + \beta \cdot [1 - \text{norm}(I^{\text{norm}}_m)]$

Loss function: Combines CCC and RMSE, optimizing for both correlation and absolute error.

On internal and external datasets, the method delivers markedly improved correlation and error vs. simpler fusion baselines; e.g., Pearson $R_p=0.97$ , $d$ 0 on the internal set (Salaj et al., 21 Sep 2025).

In the context of recommender system security, AVP-Fusion denotes a visual adversarial attack strategy that aligns latent-space perturbations of images to pseudo-user preference embeddings (Ling et al., 30 Jul 2025). The pipeline:

Encoding: Uses (if available) multi-hop user-item interaction encoding, or item-item centroids in the absence of user data.
MLP-based perturbation: A learned MLP maps the preference vector to a latent perturbation, injected into the latent of a VAE-based diffusion model during the forward process:

$d$ 1

Adversarial constraints: Training enforces visual plausibility (CLIP, SSIM) and semantic user alignment. Ablation shows HR@5 for target item exposure increases by up to 20×–100× compared to prior visual-only or shilling baselines.
Stealth: Human evaluators and feature visualizations confirm near-indistinguishability from authentic images.

This generalizes to settings without explicit user data by employing cluster or global popularity embeddings, enabling robust and stealthy attack variants (Ling et al., 30 Jul 2025).

6. Comparative Summary of AVP-Fusion Variants

Use Case	Fusion Principle	Core Mechanism	Noted Results
Collaborative AVP Perception	Adaptive BEV, CP, compression	S-Ada/C-Ada fusion	+21%–34% AP, +3m/s safe
Antiviral Peptide Identification	Adaptive gating, contrastive	CNN/BiLSTM fusion, OHEM	0.95 ACC, 0.91 MCC
Audio-Visual Pretraining for V2A	SlowFast dual-stream	Cross-modal InfoNCE	–10.3% FDist, +5% DeSync
AVQ Prediction	Hybrid attention	Cross/self-modal attention	$d$ 2, RMSE=0.22
Adversarial Visual Perturbation Fusion	Cross-modal MLP attack	Latent VAE + pseudo-preference	20×–100× HR@5 gain

Each implementation of AVP-Fusion is tightly tailored to its domain, but all share modular, context-adaptive fusion mechanisms, explicit cross-modal/statistical alignment, and empirical justification for fusion complexity.

7. Interpretability, Limitations, and Future Directions

Across domains, interpretability emerges from explicit gating weights (AVP recognition), per-modality relevance estimation (AVQ), and semantic constraint losses (adversarial perturbation). The methods demonstrate how domain-specific attention/fusion architectures enable dynamic adaptation to context, sample difficulty, or data scarcity. Nontrivial compression and bandwidth-aware design features in AVP-Fusion for AVP systems highlight the critical intersection between perception accuracy and realistic deployment constraints (Jia et al., 2024).

This suggests that AVP-Fusion methodologies will continue to propagate to additional multi-modal domains—ranging from privacy-preserving federated learning (as hinted in (Ling et al., 30 Jul 2025)) to internet-of-things sensor networks, and large-scale biological sequence modeling—driven by the need for interpretable, scalable, and robust cross-modal integration.