ARC-Hunyuan-Video-7B: Compact Multimodal Video Model

Updated 12 November 2025

ARC-Hunyuan-Video-7B is a compact 7B-parameter multimodal model that integrates visual, audio, and text inputs with explicit timestamp overlays for structured short video comprehension.
It utilizes a ViT-based visual encoder, Whisper-v3 audio encoder, and a decoder-only transformer with modality alignment for precise temporal grounding and multimodal fusion.
The multi-stage training regimen, including reinforcement learning optimization, achieves state-of-the-art performance on temporal grounding and video analytics benchmarks with efficient inference.

ARC-Hunyuan-Video-7B is a compact, 7-billion-parameter multimodal LLM (MLLM) developed for structured comprehension of real-world user-generated short videos. Targeting highly complex video content—such as material found on WeChat Channel and TikTok—the model delivers temporally structured, audio-visual-text reasoning with explicit timestamped outputs, enabling advanced downstream applications in search, recommendation, temporal grounding, and video analytics (Ge et al., 28 Jul 2025).

1. Model Architecture and Signal Fusion

ARC-Hunyuan-Video-7B is built upon the Hunyuan-7B vision-language backbone, comprising a ViT-based visual encoder, a Whisper-v3 audio encoder, and a large decoder-only transformer. This architecture is augmented with a modality alignment MLP and a timestamp overlay mechanism for fine-grained temporal awareness.

Visual Encoder: Frames are sampled uniformly at 1 fps, with a maximum of 150 frames per video, each resized to 640×640 pixels. Each frame is overlaid with an explicit HH:MM:SS timestamp, and encoded by ViT into 112 tokens per frame.
Audio Encoder: Audio is divided into 2-second segments, with Whisper-v3 processing up to 150 such segments (30 sec chunks), each yielding 1,500-dimensional features, subsequently projected via an MLP to match the ViT token size.
Audio-Visual Fusion: For each frame, the corresponding audio token sequence is zero-padded to 112 and element-wise added to the visual tokens ( $E_t = V_t + \mathrm{pad}(A_t)$ , with $E_t \in \mathbb{R}^{112 \times d}$ ). The result is a synchronized sequence of multimodal embeddings, with positional encodings, input to the LLM.
LLM Decoder: The backbone is a 32-layer transformer with hidden size 4,096, FFN size 16,384, and 32 attention heads, supporting a context of 20,000 tokens.
Output Heads and Task Control: Task-specific prompts enable multi-granularity timestamped captioning, multi-level summarization, open- and multiple-choice Q&A, temporal grounding (start–end prediction), and chain-of-thought formatted reasoning. Generation is autoregressive from fused embeddings.

This architecture supports explicit temporal processing, with timestamp overlays directly improving grounding accuracy, as confirmed by ablation studies.

2. Data Sources and Automated Annotation

The training pipeline employs an automated, bootstrapped annotation mechanism to generate high-quality multimodal supervision:

Bootstrapped Annotations: Audio transcriptions are obtained via Whisper-v3; frame captions and optical character recognition (OCR) are provided by InternVL-2.5-8B. Additional metadata includes video titles and descriptions. Closed-source LLMs, prompted with chain-of-thought, aggregate these inputs into event-level descriptions, attitude, and audience tags, culminating in a structured video summary.
Iterative Refinement: The initial model is used for prediction-driven annotation improvement, with subsequent rounds of LLM-based curation, reducing annotation noise and enhancing label informativeness.
Corpora Composition:
- 4.5M proprietary shorts with detailed annotations,
- 0.2M public academic videos,
- 4.7M image–text pairs (frame captions and OCR),
- 3.2M audio–text pairs (ASR-filtered for semantic coverage),
- 0.5M temporal-grounding instances,
- 50K event- and 80K chapter-level timestamped captions.

This large-scale, structured, and multi-source supervision dataset is designed to address fast pacing, high information density, multimodal interplay, and content diversity in real-world shorts.

3. Training Regimen

Training comprises multiple sequential stages, each targeting distinct model capacities and data distributions:

Pre-training:
- Stage 1: Audio warm-up—adaptation to audio signals using ASR+image–text pairs. Only the MLP adapter and LLM are updated; ViT is frozen.
- Stage 2: Full multimodal pre-training with next-token prediction loss:
$\mathcal{L}_{\mathrm{pretrain}} = -\sum_t \log p(x_t | x_{<t}, E_{1...T})$

LR = $2\times10^{-5}$ , DeepSpeed ZeRO 1, context 20K.
Initial Instruction Fine-tuning: Diverse supervision on 460K open QAs, 70K MCQs, 20K domain QAs, 15K summarization samples, 12K captions, 15K grounding instances (LR = $1\times10^{-5}$ ).
Cold Start RL: 146K chain-of-thought (CoT) annotated instances (MCQ, grounding, open-QA, summarization, captions).
Reinforcement Learning (GRPO): Group Relative Policy Optimization (a generalized RLPO/PPO variant) with the objective:

$\max_{\pi} \mathbb{E}_{\pi}[r(\tau) - \beta \mathrm{KL}(\pi \| \pi_{\text{old}})]$

where $r(\tau)$ is MCQ correctness or intersection-over-union (IoU) for grounding, $\beta=0.1$ ; LR = $2\times10^{-5}$ , DeepSpeed ZeRO 3.
Final Instruction Fine-tuning: On aggregated 25K human-annotated QAs, 100K self-generated MCQs with CoT, and 50K grounding/traces.

This comprehensive training regimen enables complex video reasoning, detailed event understanding, and robust zero/few-shot transfer.

4. Evaluation, Benchmarks, and Ablation

ARC-Hunyuan-Video-7B was evaluated on both internal and public benchmarks:

ShortVid-Bench:

Multi-dimensional multiple-choice questions, covering temporal reasoning, affective intent, creator intent, narrative comprehension, humor/meme deconstruction, and creative innovation.
Accuracy metric; ARC-Hunyuan-Video-7B achieves 74.3%, compared to 67.8% for Qwen2.5-VL-7B, and 53.5% for Keye-VL-8B.

Temporal Grounding:

Charades-STA ( $\mathrm{mIoU}$ ): 54.8% vs. 46.9% (Qwen2.5-VL-7B)
ActivityNet ( $\mathrm{mIoU}$ ): 41.7% vs. 25.1% (Qwen2.5-VL-7B).

General Video Tasks:

MVBench accuracy: 62.6%,
VCR-Bench MCQ: 50.5%,
Video-Holmes: 40.9%.

Ablation Findings:

Removing timestamp overlay decreases grounding $\mathrm{mIoU}$ by ~15%.
Excluding GRPO post-training reduces summary LLM-judge performance from 6.99 to ~6.5.

These results demonstrate state-of-the-art comprehensiveness, especially in real-world short and grounded video understanding.

5. Efficiency and Deployment

The model exhibits strong computational efficiency suitable for large-scale production:

Inference: Processing a 1-minute video (150 frames + audio segments, $\approx$ 500 generated tokens) takes 10 seconds on an NVIDIA H20 GPU with vLLM acceleration.
Production Deployment: Real-world A/B tests yield measurable product gains:
- Video retrieval click-through-rate (CTR): +5.88%
- Dwell time: +5.11%
- Floating layer CTR: +7.26%
- Long-click rate: +3.34%
Fine-tuning on Downstream Tasks (1K samples/task):
- Brief summary Pass Rate (PR): $0.71 \rightarrow 0.82$
- Detailed summary PR: $0.63 \rightarrow 0.74$
- Browsing words PR: $0.82 \rightarrow 0.88$

Downstream support includes rapid zero-shot or low-resource adaptation for granular video summarization and descriptive keyword extraction.

6. Limitations and Open Challenges

Several technical limitations are identified:

Frame Sampling: Uniform sampling at 1 fps may not capture fast motion or events occurring between frames.
Timestamp Overlay: Temporal annotations, while improving grounding, can visually clutter some domains and may be distracting.
Annotation Noise: Dependence on automated annotation retains a nonzero level of labeling noise, only partially mitigated by iterative re-annotation.
Reinforcement Learning Scope: Current GRPO stage optimizes mostly for MCQ and temporal grounding tasks; RL for creative and open-ended reasoning remains underexplored.
Fixed Frame Budget: The cap of 150 frames can limit coverage in longer, highly dynamic content.

A plausible implication is that future iterations may require adaptive frame sampling or scalable multimodal context windows to maintain state-of-the-art generalization in highly dynamic or extended-length videos.

7. Significance and Distinguishing Features

ARC-Hunyuan-Video-7B is the first compact (7B-parameter) MLLM explicitly designed for “structured video comprehension” of complex, real-world, short-form content (Ge et al., 28 Jul 2025). Its distinguishing features include:

Fine-grained, explicit audio–visual synchronization and temporal structuring.
Timestamp overlays that empirically boost temporal grounding accuracy.
Automated, self-reinforcing annotation pipeline that enables rapid curation of high-quality, multimodal labels at scale.
Multi-stage training, including reinforcement learning on verifiable tasks, yielding both strong supervised and RL-induced generalization.
Demonstrated state-of-the-art performance on both proprietary and public benchmarks, with documented quantitative and qualitative improvements in real-world video product metrics.

The combination of efficiency, accuracy, and broad applicability positions ARC-Hunyuan-Video-7B as a benchmark for compact, production-scale multimodal comprehension on real-world user video content.

PDF Markdown Chat (Pro)

References (1)

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ARC-Hunyuan-Video-7B.