Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 106 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

RynnEC-Bench: Region-Centric MLLM Evaluation

Updated 22 August 2025
  • RynnEC-Bench is a region-centric benchmark designed to evaluate embodied cognitive capabilities in egocentric video settings, emphasizing object property segmentation and spatial reasoning.
  • It features two main subsets—object and spatial cognition—which use precise metrics like Mean Relative Accuracy, Rotational Accuracy, and Global IoU to assess performance.
  • The benchmark employs advanced mask-based segmentation and leverages over 20,000 real-world egocentric videos to support robust evaluation in robotics and embodied intelligence research.

RynnEC-Bench is a region-centered benchmark designed to evaluate the embodied cognitive capabilities of multimodal LLMs (MLLMs) in real-world, egocentric video settings. It emphasizes fine-grained assessment of object properties, detailed object segmentation, and complex spatial reasoning, simulating the perceptual and cognitive demands faced by embodied agents such as indoor robots. RynnEC-Bench introduces a region-centric video paradigm that departs from global scene or static image benchmarks, foregrounding instance-level understanding required for robust perception and interaction in dynamic physical environments.

1. Structure and Scope

RynnEC-Bench is comprised of two primary evaluation subsets: object cognition and spatial cognition. Object cognition tasks focus on the identification and segmentation of object properties, including attributes such as color, shape, material, and quantity. This subset further distinguishes between direct referring expressions (e.g., “the blue mug”) and situational referring expressions that involve context-dependent identification (e.g., “the cup next to the red book”).

Spatial cognition tasks probe the model’s 3D awareness using egocentric video streams, with questions regarding absolute and relative scales, distances, position estimation, and directional relationships. These are further divided into egocentric (agent-relative) and world-centric (scene-layout) subtasks. By encompassing both perceptual and spatial reasoning, RynnEC-Bench aligns evaluation with the needs of embodied agents operating in continuously changing environments.

2. Methodological Innovations

A key methodological feature of RynnEC-Bench is the region-centric video paradigm, where the evaluation proceeds not on global image or video features but on object-level mask-based regions. Objects are first segmented from continuous video frames using a specialized region encoder and mask decoder. This mask-centric approach enables fine-grained, instance-specific analysis, ensuring accurate visual grounding even amid visually ambiguous or cluttered indoor environments.

The benchmark’s question-answering and segmentation protocols explicitly harness this regional structure to challenge models on object disambiguation, perceptual grounding, and spatial reasoning at a resolution matching human cognitive requirements for manipulation and navigation.

3. Performance Metrics

The evaluation framework in RynnEC-Bench is tailored to the answer type, with distinct metrics for numerical, angular, textual, and segmentation tasks:

MRA=1CθCI(y^yy<1θ)\text{MRA} = \frac{1}{|\mathcal{C}|} \sum_{\theta \in \mathcal{C}} \mathbb{I}\left(\frac{|\hat{y} - y|}{y} < 1 - \theta\right)

where C={0.5,0.55,,0.95}\mathcal{C} = \{0.5, 0.55, \dotsc, 0.95\}.

  • Rotation/Angle Questions: Rotational Accuracy (RoA) accounts for the periodicity of angles,

RoA=1min(min(y^y,360y^y)90,1)\text{RoA} = 1 - \min \left( \frac{\min(|\hat{y} - y|, 360 - |\hat{y} - y|)}{90}, 1 \right )

prioritizing accuracy within 90°.

  • Textual Answers: Close-ended questions are scored via binary outputs from GPT-4o, while open-ended responses are rated on a 0–1 scale in increments of 0.2, adapting the granularity of human scoring to expected answer uncertainties.
  • Segmentation Tasks: Standard region overlap (J\mathcal{J}) and boundary (F\mathcal{F}) metrics are supplemented by the Global IoU,

Jˉ=i=1NSiGii=1NSiGi\bar{\mathcal{J}} = \frac{\sum_{i=1}^N |\mathcal{S}_i \cap \mathcal{G}_i|}{\sum_{i=1}^N |\mathcal{S}_i \cup \mathcal{G}_i|}

offering a global, video-wide assessment of mask prediction quality—particularly important given the intermittent visibility of objects in egocentric video streams.

4. Dataset Construction and Technical Details

RynnEC-Bench leverages a large-scale dataset composed of over 20,000 egocentric videos recorded in more than 200 residences. Manual verification was employed to establish a balanced evaluation subset from ten houses, each distinct from training environments. The data pipeline uses Grounding DINO 1.5 and SAM2 instance segmentation models in tandem with human-in-the-loop refinement to generate more than 1.14 million instance masks.

Spatial question–answer pairs are generated using 3D scene reconstructions via MASt3R-SLAM; RANSAC plane fitting is then applied to extract accurate planar and positional parameters, which are embedded into QA templates for model evaluation.

Object taxonomy in RynnEC-Bench is carefully sampled: the test set reflects real-world object frequency distributions, with 12 coarse and 119 fine-grained object classes plus additional “other” types, ensuring ecological validity of encountered object types.

5. Benchmark Comparison and Distinguishing Characteristics

RynnEC-Bench is set apart from benchmarks such as OpenEQA, STI-Bench, and ECBench, which typically focus on scene-level reasoning or rely on static images and textual spatial questions. By contrast, RynnEC-Bench provides region-level questions grounded in continuous video, supporting mask-based, temporally coherent, and context-sensitive evaluation. This enables the benchmark to more rigorously evaluate the perceptual and spatial capabilities aligned with the demands of real-world embodied cognition.

A summary comparison is provided below:

Benchmark Data Modality Region-Centric Temporal Context
RynnEC-Bench Egocentric video Yes Continuous video
OpenEQA Static images No No
STI-Bench Static/scene text No No
ECBench Various Partial Limited

6. Applications and Implications

RynnEC-Bench is employed as an evaluation protocol for embodied cognition models, particularly MLLMs serving as cognitive cores in robotics. By testing perceptual detail (object property understanding), segmentation, and spatial reasoning, the benchmark supports the development of agents capable of accurate object localization and interaction in indoor spaces.

Emphasis on region-centric video data brings model evaluation closer in structure to human perception, facilitating research in robust navigation, manipulation, and goal-directed planning. The standardized protocols and diversified object taxonomy enable quantitative comparison and assist in the iterative development of embodied cognitive architectures.

A plausible implication is that, by aligning evaluation more closely with task-centric robotic requirements, RynnEC-Bench may accelerate progress toward unified perception–planning frameworks and improve real-world agent generalization across diverse indoor environments.

7. Prospective Directions

Several directions for advancing RynnEC-Bench and its undergirding methodology are emphasized:

  • Joint Reasoning Integration: Future research is expected to focus on combining object perception, spatial inference, and action planning within a single cognitive core for multi-step embodied tasks.
  • Unified Perception and Planning: There is an articulated plan to fuse perceptual modeling in RynnEC with vision-language action frameworks for holistic embodied intelligence capable of adaptive closed-loop decision-making.
  • Data Scale and Diversity: Expansion of dataset scope and environmental heterogeneity is anticipated to further challenge model robustness, especially for tasks demanding long-range navigation or mental imagery.
  • Real-Time Deployability: Given the efficiency observed at approximately 2B model parameters, ongoing work may prioritize further model compression and computational efficiency to enable on-device operation in resource-constrained robotic systems.

Conclusion

RynnEC-Bench constitutes a comprehensive region-based benchmark for evaluating the embodied cognitive skills of modern MLLMs, bridging the gap between global scene understanding and fine-grained, actionable perception in real-world environments. Through its innovative region-centric, temporally extended, and ecologically representative evaluation paradigm, it provides a rigorous foundation for advancing research in embodied cognitive systems and their practical deployment in robotics and related domains (Dang et al., 19 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)