Video-SALMONN: Multimodal AV Analysis

Updated 9 April 2026

Video-SALMONN is a family of methods for multimodal video, audio, and sonar understanding with applications in ecological fieldwork and general-purpose AV tasks.
It combines domain-specific sonar echogram regression (e.g., ResNet-18 based fish counting) with large-scale audio-visual LLMs using fusion techniques like Q-Former.
The framework leverages advanced techniques such as LoRA adaptation, process DPO, test-time training, and expert-in-the-loop strategies for robust and ethical deployment.

Video-SALMONN refers to a family of methods and models for multimodal video, audio, and sonar understanding, unified by a focus on audio-visual language modeling (AV-LLMs), downstream perception, reasoning, and expert-in-the-loop frameworks, with a prominent role in both ecological/field applications (notably salmon fisheries management) and in general purpose captioning and reasoning benchmarks. The term encompasses diverse approaches, from lightweight sonar-based escapement monitoring to large-scale, reasoning-capable multimodal LLMs, with documented deployments in both academic and field contexts (Sun et al., 2024, Brunt et al., 7 Feb 2025, Xu et al., 10 May 2025, Tang et al., 18 Jun 2025, Sun et al., 13 Oct 2025, Sun et al., 17 Feb 2025, Tang et al., 2024, Jelea et al., 20 Mar 2025, Jung et al., 27 May 2025).

1. Architecture Families and Core Methodologies

Video-SALMONN comprises two principal architecture paradigms: (i) compact, domain-specific computer vision models—e.g., those for sonar echogram-based fish counting—and (ii) large-scale, speech- and audio-augmented LLMs interfacing with multiple sensory modalities.

(A) Sonar Echogram Regression

The sonar-specific instantiation processes continuous ARIS sonar video blocks (typically thousands of frames) by converting each fixed-width window (e.g., 200 frames) into a 2D echogram image, where axis correspond to time and range/intensity or lateral position. This involves:

Background Subtraction: Compute mean background $\overline{V}(i,j)$ , subtract from each $V_t(i,j)$ , apply connected-component analysis via multi-level thresholding ( $\alpha_0 < \alpha_1 < \alpha_2$ ).
Collapse to Echogram: For each frame, collapse the $B$ $B$ beams to form a two-channel profile, where:
- $I_t(r) = \max_b V_t(r,b)$ (intensity)
- $P_t(r) = \arg\max_b V_t(r,b)/B$ (lateral position)
Stack and Resize: Stack across $T=200$ frames, yielding $E \in \mathbb{R}^{R \times T \times 2}$ , resize to 800 × 200 for CNN input.

A ResNet-18 backbone (ImageNet-pretrained, with fixed convolutional layers) produces two outputs—predicted upstream and downstream counts—through a ReLU-constrained linear head. The system is trained under a weak supervision regime: strong manual count labels for a small data subset, large-scale weak labels via prior detector+tracker, and domain-specific augmentations (vertical flip, realistic horizontal flip, echogram superposition) for generalization. Loss is mean squared error on the counts (Brunt et al., 7 Feb 2025).

(B) Audio-Visual LLMs

General-purpose video-SALMONN architecture adopts a frozen LLM (e.g., Vicuna, Qwen, Qwen2.5-VL; 7B–13B parameters) with several pre-trained, fixed encoders:

Visual encoder: (e.g., SigLIP, ViT, InstructBLIP-ViT)
Audio encoder: (e.g., Whisper-Large-v3, BEATs)
Fusion via Q-Former: Multi-resolution causal Q-Former (MRC-QF) concatenates and projects frame-aligned visual, audio, and speech embeddings, with sliding-window attention at multiple resolutions.
Token interleaving: Audio and visual tokens are temporally interleaved or synchronized, then concatenated to prompt the LLM.

Low-rank adaptation (LoRA) is universally applied for parameter-efficient downstream adaptation. Upgrades include causal self-attention for temporal reasoning, diversity loss for mitigation of collapse within Q-Former outputs, and unpaired audio-visual mixed training to prevent modality dominance (Sun et al., 2024, Tang et al., 18 Jun 2025, Sun et al., 17 Feb 2025, Tang et al., 2024).

2. Task Specialization and Evaluation

Video-SALMONN systems address a broad spectrum of tasks, leveraging their flexible architecture.

(A) Field-Ecological Applications

Fish counting and escapement: Processed echograms are fed into ResNet-18 to output upstream/downstream fish counts over 200-frame windows. On real Kenai river data, the normalized mean absolute error (nMAE) is 23% (left bank, in-distribution), rising to 30.7% (right bank, out-of-distribution), with considerable class imbalance—downstream counts being more error-prone (Brunt et al., 7 Feb 2025).
Detection, classification, length measurement: For video/sonar deployments, models incorporate YOLOv10, RT-DETR decoder, ResNet-50-derived segmentation (SAM2) on sonar, and DeepSORT tracking for unique-count assessment and length estimation via camera or sonar calibration equations. Typical error rates: video detection mAP@50=0.85, sonar detection mAP@50=0.78, length RMSE 0.07 m (Xu et al., 10 May 2025).

(B) General Audio-Visual Language Understanding

Video question answering (QA), video captioning, reasoning: Video-SALMONN-based AV-LLMs attain leading performance on SAVE, Video-MME, MLVU, LVBench, VideoEvalPro:
- On Video-MME: 74.2% (video-SALMONN S, 8B, prompt-dependent memory), 67.4% (video-SALMONN 2, 7B, A+V)
- Caption completeness/hallucination rates (video-SALMONN 2): MissRate=10.0%, HallRate=12.9%, outperforming GPT-4o and Gemini-1.5-Pro (Sun et al., 13 Oct 2025, Tang et al., 18 Jun 2025).
Step-level multimodal reasoning: The reasoning-enhanced video-SALMONN-o1 trains on 30K reasoning-intensive QA pairs using process direct preference optimization (pDPO), yielding +6–8% improvements versus SFT; enables zero-shot synthetic video detection (17.8 F1 SynthDec, no extra data) (Sun et al., 17 Feb 2025).
Streaming LLMs for long-duration video: video-SALMONN S introduces test-time-training (TTT) modules for continual memory update, maintaining accuracy for 3-hour/10k-frame sequences under fixed memory budgets. The TTT_HF module utilizes Hessian-free conjugate-gradient optimization, and a prompt-dependent memory reader dynamically retrieves context-relevant tokens (Sun et al., 13 Oct 2025).

3. Training Schemes and Optimization

Training in Video-SALMONN models is characterized by explicit transfer/fusion strategies, curriculum schedules, and performance-driven preference optimization.

LoRA adaptation and multi-round DPO: Only adapters are tuned in SFT and reinforcement learning (RL); multi-round DPO with periodic LoRA merging/re-initialization stabilizes training, avoiding reference drift and overfitting. Guided DPO (gDPO) incorporates cross-entropy loss toward ground-truth captions, balancing RL and supervised signals (Tang et al., 18 Jun 2025, Tang et al., 2024).
Process DPO (pDPO): For step-wise multimodal reasoning, pDPO applies contrastive step selection on per-step rollouts. Each reasoning path is split into steps; step-level rewards, based on empirical correctness across perturbed inputs, are used for paired preference optimization (Sun et al., 17 Feb 2025).
Domain-specific augmentations: For echogram-based tasks, vertical/horizontal flips with channel correction and superposition of echograms simulate realistic density and directionality scenarios (Brunt et al., 7 Feb 2025). For fish segmentation, composite synthetic data pipelines apply thin-plate spline shape warping and per-channel histogram matching, followed by instance compositing into real backgrounds (Jelea et al., 20 Mar 2025).

4. Advanced Model Innovations

Streaming memory modules: video-SALMONN S addresses the scaling barrier of fixed-token LLMs, using TTT_HF to perform test-time parameter updates (SGD or Hessian-free via conjugate-gradient), producing adaptively compressed memory tokens. Prompt-dependent memory readers select the most relevant memory chunks for LLM inference, outperforming both naive offline and prior streaming methods (Sun et al., 13 Oct 2025).
Fork-Merge Decoding (FMD): A plug-in inference-only enhancement that, without retraining, performs a forked pass through the early decoder layers on video-masked and audio-masked variants before merging the resulting hidden states for balanced reasoning. FMD consistently improves modality balance and reduces hallucination across Video-SALMONN and comparable models (Jung et al., 27 May 2025).
Atomic event metrics: Accuracy and completeness of captions and QA responses are measured not only at the string or token level, but also via coverage/hallucination errors over a canonicalized set of atomic events extracted using a reference LLM (e.g. GPT-4o, GPT-3.5). This grounds evaluation in semantic coverage rather than shallow overlap (Tang et al., 18 Jun 2025, Tang et al., 2024).

5. Active Learning, Expert-in-the-Loop, and Ethical Dimensions

Expert-in-the-loop and active learning features are central for ecological deployments and culturally sensitive settings, notably Indigenous-led fisheries management:

Active selection for annotation: Uncertainty and margin-based sampling identify frames for expert review, with annotation integrated via web dashboards and feedback cycles (Xu et al., 10 May 2025).
Federated learning: "FedVision" ensures raw media does not leave local deployments, preserving community data sovereignty.
Co-governance and open dashboards: Data policies enable withdrawal rights and community oversight; model releases and visualization portals (SalmonVision) are controlled under co-governance licenses.
Domain adaptation: Alignment/distillation pipelines facilitate cross-domain deployment, bridging domain gaps between clear-water and turbid sites via pseudo-labeling and fusion (Xu et al., 10 May 2025).

6. Limitations and Future Directions

Across both sonar and LLM-based Video-SALMONN lines, several technical and operational limitations are noted:

Class imbalance and rare event detection: Low downstream fish counts challenge regression models due to high relative error; rare-action and rare-object coverage in general AV-LLMs depends on sampling and augmentation.
Overlapping trajectories in echograms: High-density fish passage creates occlusions, leading to undercounting. Attention or spatiotemporal deconvolution within CNNs is a promising direction (Brunt et al., 7 Feb 2025).
Memory and efficiency scaling: While streaming solutions such as TTT_HF mitigate fixed-frame bottlenecks, further advances in memory management and prompt-dependent retrieval are needed for deployment on resource-constrained or low-latency hardware (Sun et al., 13 Oct 2025).
Catastrophic forgetting and modality dominance: Rebirth tuning and unpaired mixing are implemented to prevent caption skills or modality-specific reasoning from degrading across RL rounds (Tang et al., 2024, Tang et al., 18 Jun 2025).
Domain adaptation and generalization: Explicit domain adaptation, particularly for new river sites or previously unseen underwater lighting/visibility conditions, remains an open challenge for both sonar-based and audio-visual models (Brunt et al., 7 Feb 2025, Xu et al., 10 May 2025).

7. Summary Table: Notable Video-SALMONN Variants

Variant	Primary Modality/Task	Core Methodology	Key Metric / Result
Echogram Regression	Sonar video, salmon escapement	ResNet-18 on temporal echogram	nMAE 23% (in-distribution, Kenai)
AV-LLM (v1)	Video, speech, non-speech audio	MRC Q-Former + Vicuna-13B + LoRA	+25–30pp on AV-QA over SOTA
Captioning (v2)	Audio-visual captioning	SFT + mrDPO + event metrics + LoRA	TotalError 22.9% vs. 31.2–38.3% baselines
Streaming S (SOTA)	Long-duration streaming video (A+V)	TTT_HF prompt-dep. memory read	Video-MME overall 74.2%
Reasoning-o1	Reasoning-intensive AV-QA	pDPO step-wise reward RL	+6–8% QA, emergent zero-shot synth. det.