DevCV Toolbox Benchmark Suite
- DevCV Toolbox is a ten-task benchmark suite that adapts established NIH Baby Toolbox measures into computational tasks for assessing vision-language models.
- It evaluates key cognitive domains—receptive language, executive function/memory, and mathematical cognition—using static, video, and interactive image-text inputs.
- Experimental results highlight significant disparities across models, emphasizing challenges in domain adaptation, sample efficiency, and developmental plausibility.
DevCV Toolbox is a ten-task benchmark suite designed for developmentally grounded cognitive evaluation of vision-LLMs. It systematically adapts all vision-related measures of the NIH Baby Toolbox—a comprehensive, psychometrically validated neurodevelopmental assessment battery for infants and toddlers—into scalable, multimodal computational tasks. DevCV Toolbox spans receptive language, executive function/memory, and mathematical cognition, targeting model capabilities aligned with early childhood milestones and requiring processing across static images, video sequences, and interactive image-text exchanges. Its tasks standardize age ranges, difficulty, and input-output modalities to benchmark developmental plausibility and sample efficiency in vision-LLMs (Wang et al., 11 Dec 2025).
1. Benchmark Structure and Task Inventory
DevCV Toolbox consists of ten tasks, grouped into three cognitive domains corresponding to NIH Baby Toolbox subdomains: Language, Executive Function/Memory, and Math. Each task is a direct computational rendering of an established assessment measure—preserving input structure, prompts, age span, and cognitive demand.
Summary of Tasks, Domains, and NIH Mapping
| DevCV Task | Cognitive Domain | NIH Baby Toolbox Measure |
|---|---|---|
| Looking While Listening | Language | Looking While Listening |
| Picture Vocabulary | Language | Picture Vocabulary |
| Localization | Language | Mullen Receptive Language #19 |
| Left/Right | Exec. Function/Memory | Mullen Visual Reception #29 |
| Spatial Details | Exec. Function/Memory | Mullen Visual Reception #20 |
| Visual Delayed Response | Exec. Function/Memory | Visual Delayed Response |
| Memory | Exec. Function/Memory | Delayed Memory |
| Who Has More (synthetic/nat) | Math | Who Has More |
| Subitizing (synthetic/nat) | Math | Subitizing |
| Object Counting | Math | Object Counting |
Input modalities range from single/multiple static images, short video clips (5–8 frames), to multi-turn visual-linguistic sequences, reflecting the multimodal richness of infant experiences (Wang et al., 11 Dec 2025).
2. Task-Specific Design and Evaluation Protocols
Each task operationalizes a developmental skill via precisely defined interface, data source, and evaluation metric:
- Looking While Listening: Two images + audio prompt; binary choice (left/right). Test set: 1,200 SAYCam examples; 2-way categorical accuracy.
- Picture Vocabulary: Four candidate images + utterance; select target image index. Dataset: 63,900 instruction / 1,200 test (SAYCam); accuracy over four classes.
- Localization: Single image, textual object label; quadrant (top-left/right, bottom-left/right) prediction. Dataset: 12,300 instruction / 2,100 test; 4-way accuracy.
- Left/Right: Target image, two mirrored distractors; select index of untransformed image. Dataset: 12,300 instruction / 2,300 test; 3-way accuracy.
- Spatial Details: Like Left/Right but with same-category, non-identical distractors in context. Dataset: 11,800 instruction / 1,200 test; 3-way accuracy.
- Visual Delayed Response: Short video or image sequence; predict object exit region (multi-class or binary). Dataset: 5,200 instruction / 900 test. Metrics: exact region accuracy, adjacent region accuracy, binary correctness.
- Memory: Multi-turn, 29-image sequence with “old vs new” recognition. Dataset: 10,000 instruction / 500 test; average over items where both recall trials correct.
- Who Has More: Two images of same object type, differing counts; binary choice. Synthetic (black background) and naturalistic (SAYCam/Ego4D) datasets.
- Subitizing: Three sequential “flash” images (1–4 objects); integer output (1–4). Synthetic (1,900) and naturalistic (200) test samples; top-1 accuracy.
- Object Counting: One image (1–12 objects); integer count output. Dataset: 13,700 instruction / 3,000 test.
All accuracy-based metrics share canonical forms, e.g.,
with task-specific adaptations for regions (, ), binary/multi-class, and memory (), maintaining developmental assessment parallels.
3. Developmental Mapping and Rationale
Each DevCV task faithfully mirrors the corresponding NIH Baby Toolbox measure, maintaining subdomain alignment, age-appropriate stimulus complexity, and response demands. Tasks such as Looking While Listening (6–24 months) and Picture Vocabulary target early receptive language acquisition, while Executive Function/Memory tasks (Left/Right, Spatial Details, Memory, Visual Delayed Response) are drawn from tasks extant in the Mullen Scales of Early Learning, Bayley Scales, and related neurodevelopmental batteries. Math tasks (Who Has More, Subitizing, Object Counting) target numerosity perception and emergent counting skills (25–42 months).
This schema ensures that computational benchmarks capture the psychological constructs of working memory, object permanence, category discrimination, rapid small-number enumeration, and explicit counting, enabling principled assessments of developmental plausibility in vision-LLMs (Wang et al., 11 Dec 2025).
4. Dataset Construction and Benchmark Setup
Datasets leverage naturalistic corpora such as SAYCam and Ego4D, with object crops, images, and video segments automatically mined and manually screened to match the original Baby Toolbox paradigm. Distractor selection in vocabulary and discrimination tasks follows prescribed semantic, phonological, and categorical similarity rules. For Visual Delayed Response, egocentric video segments with moving heads or panoramic motion paired with static objects ensure robustness to spatiotemporal confounds. Importantly, all benchmarks aim to minimize synthetic curation except where naturalistic data is unavailable or less psychometrically valid (e.g., Who Has More synthetic variant).
Test sets for each task are of sufficient scale (typically 1–3k examples) to support robust generalization measurement, with instruction-tuning datasets orders of magnitude larger for a subset of tasks.
5. Experimental Results and Comparative Performance
Table 2 (in-domain, SAYCam) and Table 3 (out-of-domain, Ego4D) summarize head-to-head evaluation of state-of-the-art models and open-source baselines on the DevCV Toolbox (Wang et al., 11 Dec 2025):
| Model | Overall (SAYCam) | Notable Task Results |
|---|---|---|
| Human Adults | 93.0% | |
| GPT-5 | 87.6% | Strong across language/memory |
| Gemini 2.5 Pro | 82.5% | |
| GPT-4o | 74.6% | |
| BabyLLaVA-V2 | 55.2% | Spatial Details 91.3%, Memory 75.3%, Who Has More (synthetic) 98.4%, Count 44.6% |
| Open-source, <3B param | 33–47% |
Key findings:
- BabyLLaVA-V2 trained from scratch on ≈1M multimodal SAYCam samples achieves 55.2% overall, with 91.3% in Spatial Details, 75.3% in Memory, 98.4% in Who Has More (synthetic), and outperforms GPT-4o in counting tasks (44.6% vs 39.0% for Count, 98.4% vs 87.9% for Who Has More).
- Domain shift to Ego4D markedly reduces performance for BabyLLaVA-V2 (41.1%) but not for GPT-5 or Gemini Pro (87–88%), indicating limits in generalization without in-domain adaptation.
- Ablation studies: instruction tuning confers substantial gains (30–40% absolute improvement for open-source VLMs), and the use of GPT-4o synthetic captions yields slightly better results than human transcript pretraining. Skipping pretraining substantially compromises few-shot capabilities.
- Random-guess baselines are provided per task (e.g., 8.3% for Count, 50% for binary-choice).
6. Significance and Implications
DevCV Toolbox constitutes a rigorous, developmentally motivated benchmark for vision-language learning, directly bridging the gap between computational models and established psychological assessment in early child development. By leveraging the format, stimulus diversity, and age calibration of NIH Baby Toolbox tasks, it enables systematic comparison of model sample efficiency, transfer, and cognitive skill compositionality.
A plausible implication is that models assessed with DevCV can be directly compared to human developmental progression stages, thus enabling both more interpretable model evaluation and principled guidance for pretraining strategy and dataset construction. The limitations observed in domain transfer—and the extreme variability of subdomain performance—highlight challenges for architectural generalization and training data diversity.
DevCV Toolbox provides a unified reference for tracking progress toward developmentally plausible and sample-efficient vision-language foundation models (Wang et al., 11 Dec 2025).