Graded Iconicity Ratings

Updated 15 October 2025

Graded iconicity ratings are continuous measures that assess the degree of resemblance between perceptible forms and their meanings using human judgments or model predictions.
They leverage both class-independent and class-dependent indicators—such as object size, occlusion, and feature distances—to predict iconicity across images, sketches, sign languages, and phonological forms.
Evaluation through inter-annotator consistency and model correlation metrics demonstrates that these ratings offer reproducible insights for enhancing both human cognition studies and AI modeling.

Graded iconicity ratings quantify the extent to which an entity’s observable form (visual, gestural, or phonological) resembles or suggests its meaning, yielding a continuous measure rather than a binary classification. While iconicity is pervasive across modalities—ranging from images and sketches to sign languages and phonological forms—graded ratings provide a principled, reproducible method for evaluating and modeling this property in both human and machine perception systems. The following sections synthesize key findings, methodologies, and implications from the literature on graded iconicity ratings in visual, linguistic, and multimodal contexts.

1. Conceptualization and Consistency of Graded Iconicity

Graded iconicity ratings systematically index the resemblance between form and meaning on a continuous or ordinal scale, typically through human judgment or model inference. In "What makes an Image Iconic?" (Zhang et al., 2014), an image’s iconicity is rated by non-experts using a three-level ordinal scale (0 = "bad", 1 = "fair", 2 = "good"), applied within contextually relevant sets (e.g., images of the same bird species). Notably, inter-annotator consistency, assessed by Spearman’s rank correlation (SRC), reached ρ ≈ 0.485–0.497 with p < 0.05, establishing that iconicity ratings are robust and not purely subjective. Comparable reliability is observed in other modalities, such as sign language, where crowdsourced continuous iconicity ratings (scale 1–7) show moderate agreement and inform the benchmarking of model predictions (Keleş et al., 9 Oct 2025).

These ratings support nuanced distinctions between degrees of iconicity and enable comparisons across entities, contexts, and experimental paradigms.

2. Indicator Properties and Predictive Modeling

A central aim in graded iconicity research is to identify features that predict or correlate with iconicity. Two principal classes of features emerge:

A. Class-Independent Indicators:

Object Size and Position: Larger relative object area (e.g., "BB-size": bounding box percentage) and central positioning ("BB-dist2center") are strong predictors of image iconicity (Zhang et al., 2014).
Occlusion: The number of visible key parts directly correlates with higher iconicity scores.
Aesthetics and Memorability: Predictors trained on datasets like AVA (for aesthetics) and SUN (for memorability) contribute quantitative signals, with domain-specific models yielding the highest correlations.

B. Class-Dependent Indicators:

Cluster Center Distance: Negative squared Euclidean distance to the class mean in feature space (e.g., GIST, Fisher Vectors) discriminates more iconic exemplars.
Classifier and Attribute-based Scores: Linear SVM outputs and attribute agreement (e.g., 312 binary CUB dataset attributes) measure how representative an instance is of its category and thus its iconic status.

These indicators are linearly combined or weighted using learned SVMs; both direct and indicator-based prediction schemes approach inter-human reliability in predicting graded ratings (SRC ≈ 0.415–0.459), especially when using robust features such as Fisher Vectors and RBF kernels (Zhang et al., 2014).

3. Modality-Specific Realizations

Graded iconicity ratings have been generalized across multiple modalities:

Sketches and Graphical Communication: Iconicity is defined as high-level visual similarity to a referent (in embedding space), with symbolicity capturing abstraction and category consistency (Qiu et al., 2021). Emergent communication games using neural agents demonstrate a graded continuum between iconic (resembling real referents) and symbolic (conventionalized) sketches, with metrics including communication accuracy, cosine similarity, and category separability.
Sign Language and Gesture: In sign language, ratings (typically 1–7) are anchored in how well manual parameters (handshape, location, movement) visually suggest the gloss meaning (Hossain et al., 2023, Keleş et al., 9 Oct 2025). Automated systems like EdGCon extract sub-lexical properties and use neighbor-based similarity search, further modulated by semantic congruence (GloVe similarity), to assign ratings. For signs, VLM predictions are benchmarked against human ratings via rank correlation and effect size metrics.
Phonological and Sound Symbolism: In spoken/written language, continuous iconicity ratings reflect the degree to which word forms carry sound–meaning correspondences. LLMs exhibit moderate to strong correlations with human-rated iconicity, contingent on model size, prompting, and exposure to task information (Loakman et al., 23 Sep 2024, Marklová et al., 10 Jan 2025).

Table: Example Indicator Properties and Their Modal Contexts

Indicator Type	Modality	Example Implementation
Object size, occlusion	Visual images	BB-size, part visibility
Embedding similarity	Sketches, images	Cosine in VGG16/CLIP space
Handshape/location	Sign/Gestures	Pose estimation/keypoint models
Phonological similarity	Language	Edit/vector-based measures

4. Evaluation Methodologies

Quantitative evaluation of graded iconicity leverages both psychophysical protocols and model–human comparisons:

Human Judgments: Ratings are collected in context (e.g., images in a set, videos of signs) and refined using inter-annotator agreement metrics (e.g., Spearman’s ρ, Cohen’s d, Fleiss’ κ).
Model Prediction: Direct predictions (e.g., numeric ratings from VLMs given meaning and video of sign) are compared to averaged human ratings (Keleş et al., 9 Oct 2025). Performance is ranked by correlational statistics (Spearman’s ρ up to ≈ 0.61 for GPT-5 on sign language ratings).
Feature Consistency: For images and sketches, resemblance is measured in high-dimensional embedding spaces, with cosine similarity offering graded alignment measures (e.g., icon2group, ingroup for generated imagery (Noord et al., 19 Sep 2025)).
User Studies and Grounded Tasks: Recognition, contextual inference, and qualitative Likert ratings further anchor the assessment pipeline (e.g., participant recognition of AI-generated recreations of iconic photographs (Noord et al., 19 Sep 2025)).

Effective evaluation must mitigate bias (e.g., rating scale compression, tendency to overrate arbitrary items), control for context, and in modeling works, verify generalization beyond datasets.

5. Applications and Generalization

Predictive and automated graded iconicity ratings are increasingly deployed in:

Educational Interfaces and Annotation: Automatic selection of "iconic" exemplars enhances learning and annotation quality in fine-grained categories (Zhang et al., 2014).
Sign Language and Gesture Standardization: Tools like EdGCon facilitate scalable, community-acceptable technical gesture creation, grounded in measurable similarity to established lexicons (Hossain et al., 2023).
Multimodal Benchmarking: The Visual Iconicity Challenge provides diagnostic tasks for multimodal model grounding using iconicity as a lens for form–meaning mapping robustness (Keleş et al., 9 Oct 2025).
AI-Generated Imagery: Semantic alignment metrics quantify the resemblance between generated and iconic images, revealing biases in generative models and guiding prompt engineering and data curation (Noord et al., 19 Sep 2025).
Psycholinguistic Modeling: Sound symbolism experiments with LLMs clarify the models' degree of iconicity awareness and suggest roles for iconicity-aware training objectives (Loakman et al., 23 Sep 2024, Marklová et al., 10 Jan 2025).

These applications underscore the broad utility of graded iconicity for both practical system design and theoretical understanding.

6. Implications and Challenges

Across domains, several key implications and open questions emerge:

Human-Machine Alignment: Models that are better at recovering structured form–meaning mappings (e.g., phonological features in sign language, visual features in sketches) are also those with higher alignment to human iconicity judgments (Keleş et al., 9 Oct 2025). However, systematic deviations persist, including the tendency of models to compress rating scales and overrate arbitrary instances.
Generalizability: Evidence from cross-linguistic pseudoword tasks and multi-modal iconicity assignments indicates that the cues underpinning graded iconicity (e.g., length, phonological or visual similarity) are, to a degree, universal and can be leveraged by both humans and LLMs (Marklová et al., 10 Jan 2025).
Data and Bias: Generative models do not consistently reproduce iconic imagery, highlighting biases toward frequency and stereotypical patterns in training data (Noord et al., 19 Sep 2025). This presents challenges for applications relying on faithfulness to historically or culturally meaningful icons.
Future Directions: The refinement of cue extraction, the diversification of training data, end-to-end learning of drawing or signing policies, and explicit representation of graded iconicity in training objectives are active research directions.

Graded iconicity ratings constitute an operationalizable, empirically grounded bridge between observed form and interpreted meaning, supporting fine-grained evaluation and model improvement across visual, gestural, and linguistic domains.