E-commerce Hierarchical Video Captioning

Updated 19 January 2026

E-commerce Hierarchical Video Captioning (E-HVC) is a paradigm that generates structured, product-aware captions by leveraging hierarchical multimodal fusion of visual, linguistic, and metadata signals.
It employs granular graph-based modeling and cross-modal aggregation, integrating ASR, visual features, and metadata to capture fine-grained product actions and scene transitions.
Evaluation shows state-of-the-art caption quality and retrieval improvements, with techniques like SPA-Compressor achieving up to a 33× speedup and reducing token count by ≈82.6%.

E-commerce Hierarchical Video Captioning (E-HVC) is a research paradigm for generating structured, product-aware textual narrations from consumer-driven e-commerce videos. E-HVC addresses the challenges of dense, multimodal video content, requiring models to recognize fine-grained product actions and attributes, align them with spoken or written narrative evidence, and synthesize multi-level descriptions or titles adapted for retrieval and recommendation scenarios. E-HVC systems systematically integrate visual, linguistic, and metadata signals through hierarchical architectures, and operate at both event-level granularity and broader narrative abstractions tailored to e-commerce domains (Zhang et al., 2020, Li et al., 12 Jan 2026).

1. Problem Formulation and Dataset Construction

E-HVC distinguishes itself from generic video captioning by targeting the e-commerce context, where consumer-generated videos are highly information-dense and product-centric. The central task involves generating structured textual outputs—ranging from single-sentence titles to temporally grounded narratives—by leveraging multimodal evidence inherent to e-commerce data.

The E-HVC dataset introduced in (Li et al., 12 Jan 2026) comprises $N = 146,000$ videos (mean length ≈ 60 s, total ≈ 2,433 hours), and a dedicated benchmark (E-HVC-Bench) of 1,852 videos spanning 13 product categories. The annotation schema provides dual levels of granularity:

Temporal Chain-of-Thought (TCoT): Ordered sequences of event annotations $E_i = (t_i^{\text{start}}, t_i^{\text{end}}, d_i)$ anchored to video intervals, capturing host actions, product demonstrations, or scene transitions.
Chapter Summaries: Partitions of events into contiguous chapters $C_j = ([c_j^{\text{start}}, c_j^{\text{end}}], \tau_j, s_j)$ with precise time boundaries, concise titles $\tau_j$ , and thematic summaries $s_j$ .

Annotation involves a three-stage process: (i) Automatic Speech Recognition (ASR) text enhancement and adaptation to product domains, (ii) frame-level visual description via advanced vision-LLMs, and (iii) hierarchical reasoning using large-scale multimodal transformers to generate and refine event and chapter annotations. Data curation applies quality filters and manual corrections for the benchmark validation (Li et al., 12 Jan 2026).

2. Hierarchical Modeling Approaches

E-HVC architectures universally adopt hierarchical processing:

Granular Interaction Modeling: At the base level, video content, comments, and attributes are processed as separate graphs. For instance, the approach in (Zhang et al., 2020) constructs three dedicated graphs—video landmark graphs $G_v$ , narrative comment graphs $G_c$ , and attribute graphs $G_a$ —with nodes encoding frame landmarks, syntactic word tokens, and product attribute entries respectively. Edges are established to capture spatial-temporal or syntactic relations, with learnable edge weights.
Cross-Modal Aggregation: Global-local aggregation modules integrate information across graphs, leveraging node-level and graph-level attention to form holistic representations (e.g., fusing $G_a$ and $G_c$ to yield $G_{ac}$ , then integrating with $G_v$ to produce $G_{vac}$ ).
Abstraction/Narrative Layer: At the upper level, frame features and graph-aggregated node embeddings are fused. Typical architectures employ gated recurrent units (GRUs) or transformer decoders, with attentive mechanisms that reference both frame-wise dynamics and cross-modal context. Generation modules produce titles or multi-paragraph summaries, with explicit mechanisms for temporal alignment and hierarchical structuring (Zhang et al., 2020, Li et al., 12 Jan 2026).

3. Scene-Primed ASR-Anchored Compression (SPA-Compressor)

E-commerce videos tend to be fast-paced with long sequences of dense visual tokens, challenging conventional MLLM (Multimodal LLM) architectures due to self-attention bottlenecks. The SPA-Compressor (Li et al., 12 Jan 2026) introduces a hierarchical token compression strategy:

Vision–ASR Fusion: ASR outputs and visual features are cross-attended, integrating text and visual evidence.
SceneFusionAggregator: Learnable scene queries summarize global, scene-level representations for each sequence segment.
EventDetailExtractor: Event-level queries extract temporally grounded event details via cross-attention to ASR and scene tokens.

Mathematically, given input $\mathcal{X} = \{\mathbf{V}, \mathbf{A}, \mathbf{T}\}$ , SPA-Compressor constructs hierarchical representations $\mathbf{H} \in \mathbb{R}^{B \times (S + N(1+E)) \times D}$ with configurable token budgets $(S, E)$ controlling compression. For a recommended setting $(S=64, E=32)$ , the compression ratio is $\rho \approx 0.174$ , resulting in ≈ 82.6% reduction in token count and a theoretical $33\times$ speedup in self-attention cost.

4. Training Objectives and Optimizations

Both graph-based and transformer-based E-HVC models are optimized using multi-task objectives:

Token-Level Cross-Entropy: For each ground-truth event description $d_i$ , chapter title $\tau_j$ , and chapter summary $s_j$ , the model minimizes the cross-entropy loss with predicted outputs:

$\mathcal{L} = \lambda_1 \sum_i \mathrm{CE}(d_i, \hat d_i) + \lambda_2 \sum_j [\mathrm{CE}(\tau_j, \hat\tau_j) + \mathrm{CE}(s_j, \hat s_j)]$

Coverage Loss and Regularization: Models may include terms to penalize repeated mass in word probabilities or apply $\ell_2$ regularization to stabilise training (Zhang et al., 2020, Li et al., 12 Jan 2026).
Optimization Schedules: State-of-the-art training employs AdamW, staged fine-tuning phases (task adaptation, compression-aware retraining), and large-scale distributed setups (e.g., batch size 64 across 16 A100 GPUs).

5. Evaluation Protocols and Results

Performance is evaluated using reference-based and retrieval-based metrics:

Reference-based: BLEU-4, METEOR, ROUGE-L, and CIDEr for captioning quality, BERTScore and SODA_c for narrative factuality and semantic alignment (Li et al., 12 Jan 2026).
Retrieval-based: Recall@1/5/10, Median Rank, and MRR are used to assess the match between generated and ground-truth narrations.
Human evaluation: Fluency, diversity, and grounding assessed on standardized sample sets.
Online AB testing: CTR improvements observed in real e-commerce platforms (e.g., +9.9% in Mobile Taobao) (Zhang et al., 2020).

The HiVid-Narrator framework sets state-of-the-art results on E-HVC-Bench (e.g., SODA_c 14.48, CIDEr 1.45, METEOR 32.01, BERTScore 74.25), outperforming prior MLLMs such as InternVL3, Qwen2.5-VL, and Keye-VL. Token compression via SPA-Compressor achieves efficiency gains (≈ 82.6% reduction) with no loss in narrative quality (Li et al., 12 Jan 2026).

Model	SODA_c	CIDEr	METEOR	BERTScore
InternVL3 (8B)	11.98	1.04	27.38	67.75
Qwen2.5-VL (7B)	12.27	1.12	27.74	68.83
Keye-VL (8B)	12.83	1.34	28.29	69.23
HiVid-Narrator (w/ SPA)	14.48	1.45	32.01	74.25

Ablation indicates the importance of TCoT annotation and token compression: removing TCoT or reducing scene/event tokens both degrade all metrics.

6. Insights, Limitations, and Prospects

Ablation studies across both frameworks confirm that hierarchical modeling—explicitly capturing event temporal structure, chapter-level narrative, and cross-modal evidence—is integral to performance. For example, dropping abstraction-level summarization or global-local aggregation reduces CIDEr and overall retrieval accuracy (Zhang et al., 2020). SPA-Compressor provides a scalable solution to the quadratic token cost of MLLMs without sacrificing factuality or coherence.

Documented limitations include dependency on ASR quality (sensitivity to accents/noise), difficulty of handling ultra-rapid scene transitions, and time-consuming annotation procedures (Li et al., 12 Jan 2026). Future extensions are proposed: integrating on-screen OCR, differentiable chapter-boundary detectors, incorporating user feedback for iterative narrative refinement, and domain transfer to other dense video genres (e.g., sports, medical).

A plausible implication is that the E-HVC methodology can serve as a general foundation for structured video understanding in high-density environments, wherever fine-grained product or procedural semantics must be coupled with robust narrative abstraction. High-quality, temporally grounded annotation remains a key determinant of downstream performance.

E-HVC situates at the intersection of graph-based multimodal fusion, hierarchical sequence modeling, and multimodal LLM (MLLM) adaptation. Early work ("Comprehensive Information Integration Modeling Framework for Video Titling" (Zhang et al., 2020)) established graph neural network approaches for integrating structured modalities. Later frameworks (e.g., HiVid-Narrator (Li et al., 12 Jan 2026)) emphasize hierarchical transformer-based reasoning, cross-modal temporal alignment, and efficient token compression. Both lines motivate future research in scalable e-commerce video understanding, dataset curation with dual-level subjective/objective annotations, and systematic benchmarking for dense-multimodal sequence modeling.

Markdown Report Issue Upgrade to Chat

References (2)

Comprehensive Information Integration Modeling Framework for Video Titling (2020)

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to E-commerce Hierarchical Video Captioning (E-HVC).