Words of Warmth Lexicon
- The Words of Warmth Lexicon is a comprehensive suite capturing association norms for trust, sociability, and warmth across over 26,000 English words.
- It employs rigorous annotation methods and robust reliability metrics, including split-half correlations, to ensure accurate social perception ratings.
- The lexicon supports quantitative research on linguistic bias, stereotype analysis, and language development with actionable insights into social cognition.
The Words of Warmth Lexicon is a large-scale suite of association norms capturing perceived trust, sociability, and warmth for over 26,000 common English words. Based on social psychological theory, the lexicon facilitates quantitative analysis of the dimensions of interpersonal perception, enables developmental and applied investigations, and supports nuanced studies of linguistic bias and stereotypes. Trust (T) and Sociability (S) ratings are derived directly from human annotators; Warmth (W) is defined as the stronger association between the two for each word.
1. Theoretical Foundations and Dimensions
Competence (C) and Warmth (W) constitute the primary dimensions for social cognition, as formulated by the Stereotype Content Model (Fiske et al. 2002). Warmth—a measure of perceived intent, encompassing friendliness and hostility—is further decomposed by recent research (Abele et al. 2016; Koch et al. 2024) into two components:
- Trust (T): Morality, honesty, integrity, sincerity, fairness.
- Sociability (S): Friendliness, gregariousness, conviviality.
Formally, each word in the lexicon is indexed with three real-valued scores , , on . Trust and Sociability are empirically established through annotation; Warmth is operationalized as:
The evolutionary and developmental literature indicates that warmth-based judgments emerge early in childhood, and that sociability precedes trust in early language acquisition.
2. Lexicon Construction and Reliability
2.1 Term Selection
The source vocabulary comprises approximately 44,000 unigrams from the NRC VAD Lexicon v2, filtered to exclude terms with near-neutral valence (), yielding 26,188 emotionally salient unigrams.
2.2 Annotation Procedure
Ratings were crowdsourced via Amazon Mechanical Turk, restricting participation to native English speakers (69% USA, rest UK, Canada, India). Annotator demographics: mean age 39.2 years, 48% female and 52% male (self-reported). Each target was rated on 7-point bipolar scales for Trust and Sociability ( = "very untrustworthy/unsociable", = "very trustworthy/sociable", $0$ = "neither"), with task instructions detailing meanings, examples, and prompting annotators to consult dictionaries for ambiguous items.
2.3 Quality Control and Aggregation
“Gold” control items (~2%) were used for real-time and silent accuracy feedback; annotators with sub-80% gold accuracy had their contributions excluded. Lexicon scores per word are aggregated as follows: Warmth is assigned according to the component with greater absolute value.
2.4 Reliability Metrics
Split-half reliability (SHR) was assessed over 1,000 random splits with the following results:
| Dimension | Mean Annots/Word | Spearman | Pearson |
|---|---|---|---|
| Sociability (S) | 7.9 | 0.965 | 0.969 |
| Trust (T) | 11.4 | 0.943 | 0.957 |
| Warmth (W) | 8.8 | 0.965 | 0.974 |
All correlations are reported as .
3. Lexicon Statistics and Distributions
3.1 Categorical Labeling
Each word is assigned a categorical label on a 7-class scale: Very/Moderately/Slightly Warm/Neutral/Slightly/Moderately/Very Cold. The class proportion breakdown is:
| Dimension | Very High | Moderately High | Slightly High | Neutral | Slightly Low | Moderately Low | Very Low |
|---|---|---|---|---|---|---|---|
| Trust (T) | 2.8 % | 13.8 % | 12.3 % | 38.6 % | 13.3 % | 14.6 % | 4.5 % |
| Sociability (S) | 11.2 % | 12.0 % | 12.7 % | 16.4 % | 13.4 % | 27.0 % | 7.4 % |
| Warmth (W) | 12.3 % | 17.0 % | 12.7 % | 10.5 % | 11.7 % | 26.9 % | 8.8 % |
3.2 Empirical Distributions
The distributions for T, S, and W are approximately zero-centered: Standard deviations for each scale are 1.2–1.3.
3.3 Inter-Dimension Correlations
Empirical inter-correlations for real-valued scores across 26k words are moderate to strong:
3.4 Illustrative Word Examples
| Dimension | Top³ (score) | Bottom³ (score) |
|---|---|---|
| Trust (T) | consoler (2.00), cohesiveness (2.18), ethicist (2.50) | narcissm (–3.00), horrible (–2.78), denigration (–2.44) |
| Sociability (S) | consoler (3.00), cohesiveness (3.00), wedding (2.88) | stalker (–3.00), gentrify (–1.75), outcast (–1.80) |
| Warmth (W) | consoler (3.00), cohesiveness (3.00), wedding (2.88) | stalker (–3.00), narcism (–3.00), horrible (–2.78) |
4. Developmental and Applied Insights
4.1 Age-of-Acquisition
Integrating W/T/S norms with age ratings (Kuperman et al. 2012) and binning words into High/Neutral/Low at 1.5, developmental analyses show:
- Children disproportionately acquire high-W and high-S words at early ages; the proportion of low-W/S words rises from age 3 to 17, with 50% of acquired W/S words being polar at each age.
- High-T word acquisition remains stable until age 10, declining thereafter as low-T acquisition rises.
- High-C word acquisition peaks near age 10, later decreasing; low-C word acquisition is highest in early years.
These patterns empirically support the primacy of valence and indicate that sociability is acquired before trust during language development.
4.2 Bias and Stereotype Reseach
Utilizing both direct lookup and co-occurrence ("co-term") methodologies with large Twitter datasets (Vishnubhotla & Mohammad 2022; Wahle et al. 2025), lexicon analysis reveals established stereotype and bias patterns:
- Social Groups: muslim, jew, immigrant exhibit low direct W; elderly score high on W but low on C; criminal scores very low on W.
- Gender Terms: direct scores show high W for all gender terms; father/mother have high C, grandmother low C. Co-term analysis of tweets: references to "you" use more high-C language, "we" more high-W.
- In-group / Out-group: bilateral analysis of Canadians and Americans finds self-references use higher W/C co-terms, consistent with in-group favoritism.
- Professions: direct scores—engineers, doctors, teachers high C; nurses and teachers high W; jobless very low. Co-term results: "CEO" higher C context than "engineer"; "doctor" co-terms display more low-C language than "nurse," evidencing context sensitivity.
A plausible implication is that the lexicon, when paired with co-term methods, provides a robust foundation for quantitative bias and stereotype investigations in digital discourse.
5. Practical Integration, Limitations, and Ethical Considerations
5.1 Usage Guidance
- Text scoring: For any text, scores can be assigned to each token for T, S, W, enabling calculation of mean, sum, or "polar" differential aggregates.
- Comparative analysis: Researchers may examine relative differences (e.g., percent increases in high-W words) across temporal or group splits.
- Bias/stereotype investigation: Co-term pairing (Turney 2002; Teodorescu & Mohammad 2023) facilitates measurement of W/C usage around target entities.
5.2 Limitations and Considerations
- The lexicon covers 26k unigrams, favoring U.S.-centric corpora.
- Scores reflect predominant word senses; specialized or ambiguous terms may require re-annotation.
- Annotator pool is skewed toward U.S., Canada, UK, India—demographic biases are possible.
- Lexicon scores reflect common perceptions (association norms), not objective reality.
- Not suitable for assessing single utterances; reliability requires aggregate analysis over multiple items.
- Scores are context-sensitive; comparative framing is recommended.
- Essentializing speakers should be avoided; focus should be on the use of warmth-related language in context.
All resources are released under terms prohibiting direct redistribution in large training corpora. The lexicon supports interdisciplinary research spanning social cognition, computational bias analysis, digital humanities, and sentiment modeling, and is intended to enrich the quantitative paper of linguistic social perception.