RynnEC: A Region-Centric Video MLLM

Updated 22 August 2025

RynnEC is a video multimodal LLM that advances embodied cognition through a novel region-centric paradigm, integrating a region encoder and mask decoder for fine-grained object perception.
It utilizes an egocentric video pipeline to generate richly annotated datasets, enabling detailed object property recognition and robust spatial reasoning.
RynnEC-Bench offers a comprehensive set of metrics to evaluate object segmentation, spatial cognition, and embodied AI tasks in real-world scenarios.

RynnEC is a video multimodal LLM (MLLM) designed to advance embodied cognition by introducing a region-centric approach to video understanding. Built on a general-purpose vision-language foundation (derived from VideoLLaMA3), RynnEC integrates a region encoder and a mask decoder to enable fine-grained, object-centric perception and interaction with video data. It demonstrates state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning while employing an egocentric video pipeline for large-scale, richly annotated data generation. The system is accompanied by RynnEC-Bench, a specialized benchmark for evaluating embodied cognitive capabilities. All code, model checkpoints, and benchmarks are openly available (Dang et al., 19 Aug 2025).

1. Model Architecture

RynnEC’s architecture is organized around a vision-language backbone with two critical modular enhancements for region-level video comprehension:

Region Encoder: This module processes object masks from egocentric videos, applying a “MaskPooling” procedure to extract features from segmented regions. The extracted features are projected by a 2-layer lightweight perceptual mapper into the embedding space of the LLM. During training, features are aggregated across multiple frames via instance tracking to build robust object-centric representations. At inference, arbitrary object masks—single or multiple—can be supplied, enabling targeted query and interaction.
Mask Decoder: The mask decoder is based on SAM2’s architecture, adapted for integration with the LLM via a linear mapping that bridges SAM2’s feature space to a special “[SEG]” token in the LLM. Fine-tuning is performed using Low-Rank Adaptation (LoRA) methods, ensuring that segmentation outputs can be generated in response to textual region queries without degrading the core multimodal reasoning abilities.

The synergy of these modules enables RynnEC to support explicit, precise grounding and segmentation of objects in video, providing embodied agents with actionable, object-centric world models.

2. Performance and Evaluation Protocols

RynnEC attains state-of-the-art performance across multiple embodied cognition tasks as quantified by the custom RynnEC-Bench:

Object Property Understanding: Includes detailed recognition of color, shape, surface properties, etc.
Object Segmentation: Supports both direct referring tasks (segmenting a specified object) and situational referring tasks (segmenting objects based on contextual cues).
Spatial Reasoning: Encompasses scale estimation, distance measurement, and relative positioning; RynnEC consistently outperforms both generalist and object-centric MLLMs.

The evaluation employs rigorous quantitative metrics:

Mean Relative Accuracy (MRA) for scale-oriented estimations:

$\operatorname{MRA} = \frac{1}{|\mathcal{C}|} \sum_{\theta \in \mathcal{C}} \mathbb{I}\left(\frac{|\hat{y} - y|}{y} < 1 - \theta\right)$

where $\hat{y}$ is prediction, $y$ is ground truth, $\mathcal{C}$ is the collection of thresholds, and $\mathbb{I}$ is the indicator function.

Rotational Accuracy (RoA) for angular judgments:

$\operatorname{RoA} = 1 - \min\left(\min(|\hat{y} - y|, 360 - |\hat{y} - y|)/90, 1\right)$

A comprehensive results table demonstrates that even the 7B-parameter edition of RynnEC outperforms leading proprietary and open-source models in aggregate and individual task scores. The model achieves high accuracy in both object and spatial cognition domains.

3. Region-Centric Video Paradigm

RynnEC operationalizes a “region-centric video paradigm” in embodied AI, shifting from holistic frame- or video-level representations to explicit region (object)-level modeling. This enables:

Fine-grained Object Understanding: Isolated encoding of physical properties at the object level, including states and alterations over time.
Precise Interaction: Direct grounding and localization for manipulation, allowing agents to engage with specific items in complex scenes.
Comprehensive Perception: Maintenance and retrieval of object-specific details such as surface attributes, which are crucial for manipulation and spatial navigation.

This paradigm provides a principled framework for aligning model perception with the physical interaction requirements faced by embodied agents.

4. Egocentric Video Data Pipeline

To counter the scarcity of annotated 3D datasets for training embodied cognition models, RynnEC employs a large-scale egocentric video pipeline that leverages only RGB imagery:

Data Collection: Over 200 households are sampled using high-resolution, high-frame-rate egocentric cameras.
Instance Segmentation Pipeline:
- Qwen2.5-VL for extracting object categories,
- Grounding DINO 1.5 for generating key object proposals,
- SAM2 for detailed segmentation and reliable tracking.
Reverse Instance Tracking: Four-second backward tracking ensures temporally consistent object identities, even in non-continuous visibility scenarios.
Two Data Generation Branches:
- Object Cognition: Aggregates captions, fine-grained properties, and referring expressions.
- Spatial Cognition: Performs 3D reconstruction from RGB and formulates template-driven spatial QA.

This pipeline enables rapid, scalable generation of richly annotated video datasets without the need for laborious 3D annotation, fueling both object-centric and spatial cognition learning.

5. RynnEC-Bench: Benchmarking Embodied Cognition

RynnEC-Bench is a rigorous region-centered evaluation suite tailored for embodied cognition and comprising:

Object Cognition Tasks: Object property recognition and segmentation (direct and situational referring).
Spatial Cognition Tasks: Split between ego-centric (tracking agent-object relationships over time) and world-centric (absolute scales, spatial placements, positional relationships).

The benchmark encompasses 22 fine-grained tasks, employing MRA, RoA, and a Global IoU metric for multi-frame segmentation accuracy. This comprehensive structure enables systematic evaluation of both perception and spatial reasoning in region-centric contexts.

6. Applications and Broader Implications

RynnEC provides a technical foundation for enhanced “cognitive cores” in embodied agents:

Robotic Manipulation: Supports complex pick-and-place, household manipulation, and other tasks requiring precise identification and segmentation.
Navigation and Planning: Enables advanced spatial awareness, such as estimating distances, directions, and orchestrating trajectories.
Long-horizon Reasoning: As demonstrated in RoboTHOR experiment scenarios, RynnEC supports cross-temporal reasoning (tracking state, counting, spatial progression) necessary for robust agent autonomy.

With parameter counts as small as 2B, the model is suitable for deployment in constrained environments, supporting efficient, real-world integration without prohibitive computational costs.

7. Release and Resource Availability

All resources associated with RynnEC—including code, model weights, and the RynnEC-Bench—are accessible through an open-source repository:

https://github.com/alibaba-damo-academy/RynnEC

This availability is intended for research, benchmarking, and practical applications in the wider embodied cognition community.

PDF Markdown Chat (Pro)

References (1)

RynnEC: Bringing MLLMs into Embodied World (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RynnEC.