RICE: Recognizing Identities for Captioning
- The paper introduces RICE, a unified framework integrating multi-frame inputs and enhanced textual features to ensure consistent identity references in video captioning.
- RICE leverages LVLMs to boost performance, achieving precision improvements from 50% to 90% and recall from 15% to 80% compared to baseline methods.
- The framework addresses the ID-Matching problem by combining robust visual context with strong feature sets, benefiting video retrieval, editing, and text-to-video generation.
Recognizing Identities for Captioning Effectively (RICE) is a methodological framework in video captioning and image description that addresses the challenge of recognizing, tracking, and uniquely describing individual entities—especially people—across multiple frames in long or complex visual content. The core motivation stems from the ID-Matching problem: maintaining consistent references to individuals or objects who reappear throughout a video, which is of substantial importance in video understanding, retrieval, and text-to-video generation scenarios.
1. Problem Definition and Motivations
The ID-Matching problem arises distinctively in long video captioning, where the same person or object is present in multiple, possibly non-contiguous segments of a video. The primary task is not only to generate an accurate and semantically rich caption per frame or segment, but also to ensure that references to an identity (e.g. a specific person) remain consistent throughout the entire video narrative. Misattribution—attributing actions of one individual to another or failing to recognize a recurring individual—undermines semantic coherence and utility in downstream tasks such as retrieval or automated video editing. Prior approaches—primarily based on explicit person re-identification or face recognition modules—typically relied on point-wise matching across frames, auxiliary face datasets, or bespoke identity association pipelines, resulting in limited generalization and scalability (Yang et al., 8 Oct 2025).
RICE was developed to directly address these challenges by integrating identity consistency into the core of the captioning process, using architectures that leverage Large Vision LLMs (LVLMs) and advanced prompt engineering techniques.
2. Core Methodology and Principles
RICE conceptualizes the identity-aware captioning task as a combination of enhanced visual feature utilization and robust textual descriptor generation. The framework comprises two principal methodological axes:
- Enhanced Usage of Image Information: RICE employs multi-frame inputs (multi-frame windowing), allowing LVLMs such as GPT-4o to process multiple temporally adjacent frames simultaneously. This method, analogous to Single-Turn (ST) dialogue in LVLM prompting, maximizes cross-frame visual attention, enabling the model to more effectively track identity features over time.
- Quantity and Quality of Descriptive Features: By introducing Enhanced Textual Features (ETF) grounded in a Strong Feature Set (SFS)—a domain-specific, hand-curated list of salient visual and contextual features for individuals—RICE substantially increases the distinguishing information present in captions. These features help uniquely describe and thus track reoccurring entities even under appearance changes, motion, or occlusion.
Unlike traditional two-stage approaches, which often involve segment-level captioning followed by global summarization (risking the loss or corruption of identity references), the RICE methodology unifies the processes to operate continuously across the full temporal extent of videos (Yang et al., 8 Oct 2025). It exploits the inherent priors and attentional capacity of foundation LVLMs, which offer broad generalization by virtue of pretraining on diverse multimodal datasets.
3. Techniques for ID-Matching and Feature Enhancement
At the implementation level, RICE introduces several concrete strategies:
- Multi-Frame Contextual Input (MF/ST): LVLMs are fed windows of contiguous frames, rather than single frames or independently processed segments. This allows for visual grounding and cross-frame comparison, anchoring identity recognition in the local temporal context.
- Prompt Engineering with Strong Feature Sets (ETF/SFS): The model is prompted to describe each individual using rich, invariant features (face, clothes, actions, spatial cues). This reduces ambiguity and supports more robust ID association over the video.
- Automated Extraction and Evaluation: The framework includes extraction tools that parse generated captions to recover predicted ID sequences, which are then matched against ground-truth annotations for benchmarking and optimization.
These strategies collectively mitigate confusion between similar-looking individuals, enhance the survival of identity cues over extended video contexts, and improve the recall of correctly identified matches.
4. Benchmarking and Experimental Validation
A key advancement introduced by RICE is a devoted benchmark for ID-Matching in the context of long video captioning (Yang et al., 8 Oct 2025). The benchmark comprises:
- Custom-Annotated Dataset: 374 long video segments with frame-level person annotations, ensuring coverage of diverse scenes and recurring identity scenarios.
- Evaluation Metrics: The main metrics include:
- Sequence Similarity: Assessed via optimal bipartite matching using the Hungarian algorithm, maximizing overlap between predicted and true sequence intervals.
- Precision and Recall: Defined as the proportion of correctly predicted frame-index pairs against all predictions and all ground truth pairs, respectively.
- Experimental Protocols: Comparative evaluations of Single-Turn (ST), Multi-Turn Single Context (MTSC), and Multi-Turn Different Contexts (MTDC) captioning schemes with and without ETF/SFS inputs, using LVLMs including GPT-4o.
Empirical results demonstrate clear gains attributable to MF and ETF usage. Precision improved from 50% to 90% and recall from 15% to 80% under the RICE framework on GPT-4o, a substantial improvement over baseline approaches (Yang et al., 8 Oct 2025).
| Method | Precision (%) | Recall (%) |
|---|---|---|
| Baseline-LVLM | 50 | 15 |
| RICE (MF + ETF) | 90 | 80 |
These findings confirm the theoretical advantages of leveraging both temporal visual information and enhanced identity descriptors.
5. Role of Large Vision LLMs (LVLMs)
LVLMs such as GPT-4o are pivotal in RICE. Their design supports:
- Cross-frame Reasoning: Through multi-frame tokenization and attention mechanisms.
- Natural Language Consistency: By generating coherent, contextually sensitive references to identities as individuals reappear.
- Leveraging Pretrained Priors: Pretraining on extensive multimodal datasets provides the models with implicit familiarity with generic appearance features and semantic cues necessary for distinguishing identities across temporal spans.
RICE utilizes these capabilities, not by retraining the models on specialized datasets, but by strategically crafting inputs and prompts to “unlock” the ID-Matching priors already present in the LVLMs.
6. Broader Applications and Future Directions
The ability to consistently recognize and track entities across long-form media has broad implications in video understanding, retrieval, and automated video editing. In text-to-video generation, more accurate captioning enables synthesized content with better character and narrative coherence. In multi-modal research, such as video summarization or event detection, persistent identities are crucial for both semantic granularity and retrieval precision.
Potential future research directions include:
- Extending beyond person identification to other salient object categories by curating domain-specific feature sets.
- Integrating additional modalities (e.g. audio, action recognition) to further boost ID-Matching—particularly in the face of occlusion or significant appearance changes.
- Investigating optimal fusion strategies between visual and textual attention for even more robust identity tracking.
- Expanding annotated benchmarks and datasets to cover more diverse real-world scenarios, including surveillance, meeting understanding, and documentary analysis.
7. Concluding Remarks
RICE defines an effective solution for the challenge of entity-aware video captioning, yielding substantial improvements in both identity precision and recall. By leveraging multi-frame input strategies and enhanced descriptive prompting within LVLM frameworks, RICE demonstrates that reliable identity consistency can be achieved without bespoke identity tracking modules or heavy post-hoc association pipelines. This result is validated across rigorous new benchmarks and opens the way for further advances in identity-centric text–video synthesis, retrieval, and multimodal reasoning (Yang et al., 8 Oct 2025).