RoboCasa & CALVIN Benchmarks

Updated 26 October 2025

RoboCasa benchmark is a large-scale kitchen simulation with 120 diverse environments, over 2,500 objects, and cross-embodiment support for robust manipulation testing.
CALVIN benchmark is a tabletop manipulation setup focused on language-conditioned, long-horizon tasks, employing multimodal sensing and sequential instruction execution.
Both benchmarks advance imitation learning by integrating data scaling, LLM-guided task design, and sensor fusion to enhance real-world generalization in robotic control.

RoboCasa and CALVIN are two influential benchmarks in the evaluation and development of robotic manipulation agents, each emphasizing large-scale diversity, simulation fidelity, and systematic assessment of general-purpose sensorimotor skills. Both serve as cornerstone testbeds for imitation learning, vision-language-action (VLA) models, and multimodal policy architectures, supporting research that targets robust generalization and language-grounded long-horizon planning.

1. Simulation Environments and Scene Complexity

RoboCasa is built on an extended version of RoboSuite using the MuJoCo physics engine, offering 120 distinct kitchen environments generated from combinations of 10 floor plans and 12 kitchen styles, along with comprehensive domain randomization through AI-generated textures. The environment includes over 2,500 interactable 3D objects from 153 object categories, with many assets generated via text-to-3D systems and filtered for quality. The kitchen scenes feature articulated furniture and appliances, photorealistic rendering through NVIDIA Omniverse integration, and physically plausible interaction dynamics. Cross-embodiment is natively supported, permitting various robot platforms (e.g., manipulators, mobile bases, quadrupeds) to operate within the room-scale scenes (Nasiriany et al., 4 Jun 2024).

CALVIN is a simulated tabletop manipulation benchmark where a 7-DOF Franka Emika Panda manipulator interacts within four visually distinct yet structurally similar indoor scenes. Each environment varies object and background textures and the placement of static elements (e.g., drawers, switches) to encourage domain generalization. CALVIN provides rich multimodal sensing, including RGB-D images from fixed and gripper cameras, proprioceptive state, and tactile signals—mirroring real-world sensory conditions (Mees et al., 2021).

Benchmark	Main Domain	#Scenes/Envs	Object Diversity	Cross-Embodiment	Physics Backend
RoboCasa	Room-scale/Kitchen	120	2,500+ (153 cat)	Yes	MuJoCo + Omniverse
CALVIN	Tabletop	4	Dozens	No	Custom/MuJoCo

The combination of diverse layouts, realistic visual and physical properties, and flexible embodiment enables both RoboCasa and CALVIN to act as high-fidelity proxies for household and industrial manipulation environments.

2. Task Design, Language Grounding, and Long-Horizon Evaluation

RoboCasa offers 100 tasks: 25 atomic skills (pick-place, open/close, turn/press, insertion, navigation) and 75 composite tasks composed via LLM-guided sequencing, mimicking naturalistic kitchen workflows. Composite tasks are built through a dual-prompt LLM pipeline, ensuring diversity and ecological validity in task structure. The atomic tasks feature multiple variants (e.g., by object and location), supporting fine language disambiguation and perceptual grounding at scale. These tasks are systematically instrumented for benchmarking and can be recombined for new task curricula (Nasiriany et al., 4 Jun 2024).

CALVIN defines 34 manipulation skills (e.g., press button, open drawer), each associated with approximately 11 crowd-sourced language instructions. Unlike RoboCasa, CALVIN emphasizes language-conditioned, long-horizon control, requiring agents to execute chains of up to 5 instructions sequentially. Each instruction may have multiple synonymous expressions, and instruction generalization is explicitly measured via zero-shot evaluation with unseen language commands (Mees et al., 2021).

Both benchmarks use automatic task success detectors based on physical world state changes to standardize rollout evaluation. Task sequencing in RoboCasa is guided by LLMs, while in CALVIN, language instructions are used directly as input to the agent, with the challenge lying in robust semantic grounding and skill composition.

3. Data Generation, Imitation Learning, and Multimodal Architectures

RoboCasa utilizes a multi-tiered data generation strategy: human operators collect 50 high-quality teleoperated demonstrations per atomic task, which serve as seeds for automated trajectory expansion via the MimicGen system. MimicGen decomposes demos into object-centric segments, adapts them to new scene configurations, and applies rejection sampling to ensure only successful task executions are included. This pipeline yields methodologically sound datasets exceeding 100,000 trajectories with minimal human labor. Policy learning is executed through Transformer-based behavioral cloning, with perceptual input from multiple RGB cameras, proprioception, and (optionally) language conditioning using CLIP sentence embeddings. Training datasets are systematically scaled up to thousands of demos per task to isolate scaling effects on policy performance (Nasiriany et al., 4 Jun 2024).

CALVIN’s data consists of ~24 hours of teleoperated “play” data—mostly unstructured interactions with only 1% annotated via crowd-sourced natural language. Agents are primarily trained using multi-context imitation learning (MCIL), which frames demonstration relabeling as goal-conditioned learning. MCIL uses a CVAE structure to infer latent plans for reaching goal states, modeling $L_{LfP} = \mathbb{E}_{(\tau, x_g) \sim D_{play}} \left[ \sum_{t} \log \pi_{\theta}(a_t \mid x_t, x_g) \right ]$ , and extends to learning long-horizon behavior through sequence modeling and language-visual fusion. Sensor combinations (e.g., static/gripper camera, tactile, proprioception) are configurable (Mees et al., 2021).

Benchmark	Demo Collection	Human/Machine Mix	Language Annotations	Core Policy Learning
RoboCasa	Human+MimicGen auto	Hybrid	via CLIP+LLM-guided tasks	Behavioral Cloning, Transformer, CLIP
CALVIN	Human teleop “play”	Human	Sparse, 11 phrasings/task	MCIL (+ CVAE), Seq2Seq

These protocol details facilitate evaluation of algorithms across sample-efficiency, sensor integration, language grounding, and robustness criteria.

4. Benchmark Comparisons, Metrics, and Empirical Findings

Both RoboCasa and CALVIN benchmarks systematically evaluate success rate—the proportion of trials where the agent completes the instructed task. RoboCasa investigates scaling effects by tracking performance across increasing dataset sizes and demonstrates a monotonic increase in mean task success, e.g., up to 47.6% with full synthetic data (from 28.8% using human demos only). Task generalization is assessed across environments with unseen stylistic configurations and object instances. Real-robot transfer experiments support the simulation-to-reality generalization claim: for pick-and-place between counter and sink, training with simulation data raises success from 13.6% (real-only) to 24.4% (sim+real) (Nasiriany et al., 4 Jun 2024).

CALVIN’s evaluation decomposes into Multi-Task Language Control (MTLC), Long-Horizon MTLC (LH-MTLC), and zero-shot transfer. Single-task MCIL achieves up to ~53.9% success; however, performance collapses for long task chains: 49% for the initial subtask in a 5-step sequence, but only 0.08% for the entire sequence. The stringent zero-shot split—train on three environments, test on a fourth, with novel language—exposes the brittleness of current policy representations and feature fusion of language and vision cues (Mees et al., 2021).

More recent architectures (e.g., MoDE, PointMapPolicy) have been deployed and extensively benchmarked on both RoboCasa and CALVIN. These models employ advanced fusion and mixture-of-experts mechanisms; for example, PointMapPolicy achieved an average 49.1% success rate on RoboCasa atomic tasks, and substantially outperformed other models on long-horizon CALVIN evaluation, with a score of 4.01 (average instruction chain rollout length) (Reuss et al., 17 Dec 2024, Jia et al., 23 Oct 2025). Notably, ablation studies indicate the necessity of fusing geometric (point maps) and visual (RGB) features for robust generalization across both platforms.

5. Baseline Limitations, Open Challenges, and Directions for Advancement

Empirical evidence from both RoboCasa and CALVIN indicates that, despite solid performance in isolated or short-horizon tasks, agent success rates drop sharply when executing multi-stage skills or transferring to unseen environments and linguistic constructions. In RoboCasa, composite tasks remain non-trivial; fine-tuning on atomic-pretrained models yields gains, but general mastery is elusive (Nasiriany et al., 4 Jun 2024).

In CALVIN, baseline MCIL policies struggle particularly with skill chaining, confounding similar language instructions (e.g., “red” vs. “blue”) and exhibiting initial state sensitivity (“causal confusion” between proprioceptive cues and language input). These findings highlight the need for more effective sensor fusion, data augmentation, domain adaptation, and feedback-driven planning methods. Future advances may arise from improved depth input exploitation (texture invariance), enhanced alignment losses bridging language and vision pathways, and RL hybrids that address demonstration-following limitations in offline regimes (Mees et al., 2021).

Emerging architectures like MoDE leverage mixture-of-expert denoisers and noise-conditioned routing to improve both computational efficiency (using up to 90% fewer FLOPs with competitive performance) and generalization through diverse multi-dataset pretraining (Reuss et al., 17 Dec 2024). PointMapPolicy demonstrates that integrating structured point cloud representations yields particular benefit for geometry-heavy RoboCasa tasks, reinforcing the importance of multi-modal learning architectures and pretraining strategies (Jia et al., 23 Oct 2025).

6. Benchmarking Methodologies, Real-World Validation, and Community Impact

Both benchmarks have established protocols for dataset release, documentation, and code availability, supporting reproducibility and cross-lab comparison. RoboCasa’s integration with high-fidelity, photorealistic renderers and cross-platform support positions it as a next-generation platform for academic and industrial research targeting generalist robots in realistic home environments. CALVIN’s systematic language generalization splits and emphasis on compositional task execution have made it a standard for evaluating language-conditioned policy robustness in the embodied AI literature.

Subsequent benchmarks such as RoboCAS (Zheng et al., 9 Jul 2024) and RoboChallenge (Yakefu et al., 20 Oct 2025) draw on the design philosophy exemplified by RoboCasa and CALVIN but extend evaluation toward real-robot deployments, dynamic object arrangements, and fine-grained metrics (e.g., stepwise “progress score”). This suggests that RoboCasa and CALVIN continue to shape the evolution of embodied AI benchmarking, providing foundational environments against which new agents and architectures are compared.

7. Synthesis and Conclusion

RoboCasa and CALVIN define the modern landscape of simulated robotic manipulation testing, emphasizing not only breadth in environmental and task diversity but also depth in multi-modal perception, language grounding, and long-horizon planning. RoboCasa excels as a large-scale, richly textured, LLM-guided simulation with robust task generation and data scaling methodologies; CALVIN leads in language-instructed manipulation and compositional skill sequencing under continuous control and variable sensor streams.

Ongoing research in imitation learning, policy architecture optimization, and multi-modal fusion is driven by the empirical challenges captured by these platforms. Progress on these benchmarks correlates with advancements in generalizable, sample-efficient, and robust robotic control policies suitable for real-world application in unstructured household domains. Both benchmarks continue to serve as critical proving grounds for evaluating the efficacy and scalability of generalist robotic systems.