Visual Iconicity Challenge
- Visual Iconicity Challenge is a framework assessing the mapping between visual sign forms and meaning in sign language, emphasizing non-arbitrary and dynamic features.
- It evaluates models through tasks such as phonological form prediction, transparency in sign-to-gloss inference, and graded iconicity ratings using metrics like accuracy and Spearman’s ρ.
- The framework reveals that current VLMs excel in static features but struggle with dynamic, embodied cues, highlighting the need for enhanced multimodal AI design.
Visual iconicity refers to the mapping between the perceptual form of a visual sign and its intended meaning, with a particular emphasis on how visual resemblance can ground and support meaning inference. In the domain of signed languages, iconicity is not an incidental phenomenon but a systematically exploited feature of the lexicon, making sign language an ideal modality for probing the capabilities of multimodal and vision-LLMs (VLMs). The “Visual Iconicity Challenge” evaluates VLMs by their capacity to recover and exploit these non-arbitrary form–meaning mappings in dynamic sign language data. Central to this approach is the development and application of psycholinguistically motivated benchmarks that assess not only low-level phonological detail but also higher-order transparency and graded iconicity judgments. This framework provides both a diagnostic and a comparative basis for evaluating the visual grounding of current and future multimodal AI systems (Keleş et al., 9 Oct 2025).
1. Foundations: Iconicity and Signed Languages
Iconicity, defined as the resemblance or direct non-arbitrary mapping between visual form and semantic content, is pervasive in signed languages. Lexical items in sign languages often exploit spatial, manual, or dynamic features to depict or mimic objects, actions, or properties—making the mapping between sign form and meaning more transparent than in most spoken languages. For instance, a sign might use a handshape and trajectory that imitates a physical object’s outline or enacts an associated action.
The structural richness of sign language—encompassing handshapes, locations on the body, trajectory shapes, repetition patterns, and handedness—serves as a fertile ground for analyzing how AI models recover visually grounded meaning from dynamic input. Human iconicity judgments in this context are well-established through psycholinguistic protocols that quantify the degree to which a sign’s form suggests its meaning.
2. Benchmark Tasks and Evaluation Metrics
The Visual Iconicity Challenge is composed of three tightly controlled evaluation tasks, each designed to probe a different dimension of the form–meaning continuum in sign language:
- Phonological Sign-Form Prediction: Models are tasked with predicting five discrete articulatory parameters (handshape, location, path shape, path repetition, handedness) from video data. Per-parameter and overall average accuracy are reported.
- Transparency (Form-to-Meaning Inference): Given a sign video, models must select the correct gloss (lexical translation) purely from visual cues. Both open-set (96-option) and reduced (10-option) identification settings are used to measure transparency, which psycholinguistically relates to how readily a sign’s form communicates its referent.
- Graded Iconicity Rating: Here, models output a numerical iconicity score (on a 1–7 scale) reflecting the degree to which the form and meaning are perceived as similar. Correlation with average human ratings is measured using Spearman’s rank correlation coefficient (ρ). Effect size between iconic and arbitrary signs is reported as Cohen’s d.
All three tasks are evaluated against human baselines, including both experts (deaf signers) and non-experts, and random chance for comparison.
3. Model Evaluation: Findings and Comparative Outcomes
Thirteen state-of-the-art VLMs were evaluated on the Sign Language of the Netherlands (NGT) dataset, which is annotated with detailed phonological and iconicity metadata.
- Phonological Form: For visually explicit features such as location and handedness, models reach up to 70% accuracy, but for complex features like handshape and path shape, only top proprietary models (e.g., GPT-5, Gemini 2.5 Pro) exceed 50%. Human baselines remain higher, with means around 0.79.
- Transparency: In open-set identification, even the strongest models (GPT-5, Gemini 2.5 Pro) achieve only ~17–18% accuracy, indicating a large performance gap relative to deaf human signers (~57/96). In a more restricted 10-choice task, VLMs perform better (best: ~42/96), but still lag substantially.
- Iconicity Ratings: The best VLMs achieve moderate correlation with human iconicity ratings: GPT-5 reaches ρ ≈ 0.61, while leading open-source models attain ρ ≈ 0.50. Most models overly compress the rating range and overestimate the iconicity of arbitrary signs, failing to replicate the discriminatory spread of human judgments. Cohen’s d is used to quantify the distinction, with top models approaching d ≈ 1.4.
A noteworthy empirical finding is that models achieving higher phonological form prediction accuracy also correlate more closely with human iconicity judgments, indicating a shared underlying sensitivity to visual structure.
4. Diagnostic Insights and Analysis
The evaluation reveals clear asymmetries:
- Models tend to excel at static, visually salient features but underperform for dynamic, action-based iconicity that requires sequential motion analysis.
- There is a documented “static bias”: VLMs focus on object-based visual resemblance (e.g., locations, static handshapes) and neglect dynamic/embodied cues (e.g., path shape, repetition, mimetic actions) that are prominent in human iconicity judgments.
- Human baselines—especially deaf signers—consistently outperform models, underscoring the continuing gap in embodied visual grounding.
These patterns mirror well-known acquisitional asymmetries in human learners, where static features are easier to acquire, but also expose fundamental limitations in current model architectures and pretraining regimes.
5. Implications for Visual Grounding and Model Design
The challenge’s results demonstrate that while some current VLMs capture visually grounded regularities, particularly at the sub-lexical level, complete recovery of non-arbitrary form–meaning mappings—especially at the level of transparency and nuanced iconicity judgments—remains elusive.
For future advancement:
- Human-Centric and Embodied Cues: The integration of human-derived features and body-anchored representations (e.g., pose keypoints, structured descriptors such as “fist moves upward near head”) is advocated. Such representations could be obtained from computer vision frameworks (e.g., MediaPipe, VideoPrism) or by incorporating auto-generated symbolic phonological descriptors as auxiliary targets.
- Improved Dynamic Encoding: Augmenting model training with instruction or fine-tuning using gloss-annotated motion sequences may help models overcome their static biases.
- Data Collection and Task Expansion: Richer multimodal datasets and benchmarking with a wider array of signed languages and iconic phenomena will allow more robust generalization and comparative analysis.
6. Summary Table of Core Tasks and Evaluation
| Task | Evaluation Metric | Human Baseline | Top Model (example) |
|---|---|---|---|
| Phonological Form Prediction | Accuracy per parameter / mean | ~0.79 (overall) | ~0.60 (Qwen2.5-VL-72B); ~0.67 (Gemini-2.5-Pro, handshape) |
| Transparency (Open/Reduced) | % Correct in N-choice | ~57/96 | ~17–18% (GPT-5/Gemini 2.5), ~42/96 (10-choice) |
| Graded Iconicity Rating | Spearman’s ρ, Cohen’s d | N/A (humans as truth) | ρ ≈ 0.61, d ≈ 1.4 (GPT-5, Gemini 2.5 Pro) |
7. Future Directions
Current results validate the diagnostic power of adapted psycholinguistic tasks. The observed correlation between phonological form prediction and iconicity rating alignment suggests that further progress in dynamic and embodied visual modeling is needed. Incorporating pose-based, gesture-specific, and human-centered signals—and instructing models to attend to both static and dynamic features—are anticipated to advance VLMs’ capacity for genuine visual grounding in sign language and related domains.
This approach, by unifying human psycholinguistic practice with modern deep learning evaluation, establishes a new foundation for research at the intersection of multimodal AI, visual iconicity, and embodied language understanding (Keleş et al., 9 Oct 2025).