Portrait Values Questionnaire (PVQ)
- The PVQ is a psychometric instrument that operationalizes human values based on Schwartz's circumplex model, employing portrait items mapped to ten value dimensions.
- It uses third-person vignettes rated on a six-point Likert scale, with precise scoring and reverse-coding to ensure consistent measurement across dimensions.
- Recent applications in large language models reveal high item memorization, strategic response manipulation, and contamination risks impacting reliable value assessment.
The Portrait Values Questionnaire (PVQ) is a psychometric instrument designed to operationalize and measure human value priorities as conceptualized in Schwartz’s theory of basic human values. The PVQ employs short "portrait" items describing characteristic goals, aspirations, or behaviors, with respondents rating how similar each statement is to themselves. Multiple versions exist, including the PVQ-40, PVQ-21, and PVQ-RR (Revised PVQ), each differing in item count, structure, and granularity. PVQ instruments are widely used both in psychological research and, more recently, as an evaluation tool for assessing value-related constructs in LLMs.
1. Theoretical Foundations and Structure of the PVQ
Schwartz’s theory postulates ten broad motivational value dimensions arranged in a circumplex architecture, wherein adjacent values exhibit motivational compatibility and opposing values exhibit conflict. The ten basic values are: Achievement, Benevolence, Conformity, Security, Stimulation, Self-Direction, Tradition, Hedonism, Universalism, and Power (Han et al., 8 Oct 2025). Within each instrument, items are mapped to these value dimensions. The Revised Portrait Values Questionnaire (PVQ-RR) uses a finer-grained mapping: 57 items, grouped into 19 facets (three items per facet) that aggregate into the ten value dimensions (Pellert et al., 29 Sep 2025). In standard formats (e.g., PVQ-40, PVQ-21), each basic value is sampled by an equal number of items (four for PVQ-40, two or three for PVQ-21), enabling computation of mean scores per dimension.
2. Item Format, Scoring, and Psychometric Properties
Each PVQ item is a brief, third-person vignette reflecting particular value-motivated behaviors (e.g., “Thinking up new ideas and being creative is important to him/her”). Respondents rate their similarity to the portrait using a six-point Likert-type scale (1 = “Very much like me” to 6 = “Not like me at all”) (Han et al., 8 Oct 2025, Choi et al., 12 Sep 2025). Item responses are reverse-coded where necessary, so that higher scores consistently reflect stronger value endorsement. Dimension scores are computed as the average across all items assigned to that value, post reverse-coding.
Key reliability indices in human samples traditionally include Cronbach’s α for internal consistency and inter-item correlation, as well as multidimensional scaling (MDS) to reconstruct the circumplex structure in empirical value profiles (Pellert et al., 29 Sep 2025). The SQuID methodology further enables item-level semantic embeddings to approximate the PVQ’s latent structure, yielding internal consistency (α ≈ 0.77) and explaining approximately 55% of inter-facet similarity variance compared to human data (Pellert et al., 29 Sep 2025).
3. Applications in LLM Evaluation
The PVQ has been repurposed to assess value profiles ostensibly expressed by LLMs. Standard protocols prompt LLMs with PVQ items and score responses on the established 1–6 scale. Recent work systematically quantifies "data contamination"—model exposure to PVQ items in training corpora—across three aspects: item memorization, evaluation memorization, and target score matching (Han et al., 8 Oct 2025).
For PVQ-40, quantitative contamination metrics from 21 modern LLMs include:
- Verbatim memorization (AED ≈ 1.155): models often reproduce items nearly verbatim.
- Key information memorization (SR ≈ 0.587): high rate of keyword recovery.
- Item-dimension mapping (F1 ≈ 0.92): near-perfect recognition of which value each item targets.
- Option-score mapping (MAE ≈ 0.123): almost exact mapping of response-coding and reverse-keying rules.
- Target score matching (MAE_target ≈ 0.354): LLMs can select options to meet arbitrary target scores in the majority of cases.
These results document broad memorization and near-saturated internalization of questionnaire logic, particularly in larger frontier models such as GPT-5 and GLM-4.5 (Han et al., 8 Oct 2025).
4. Empirical Limitations of PVQ-Based LLM Assessment
Application of the PVQ to LLMs yields several measurement issues (Han et al., 8 Oct 2025, Choi et al., 12 Sep 2025):
- Instability and Limited Item Coverage: PVQ-40 and PVQ-21 show wide 95% confidence intervals for value scores (mean CI width ≈ 1.34 and 0.86 on the 6-point scale, respectively), with instability most pronounced for low-item dimensions (e.g., Conformity).
- Spurious Internal Consistency: High average inter-item correlation (AIC: 0.439 for PVQ-40) reflects models’ dimension recognition rather than stable value orientation. When similarity-based, ecologically valid questionnaires are used, this collapses (AIC ≈ 0.217) (Choi et al., 12 Sep 2025).
- Profile Divergence and Misleading Construct Attribution: Model value profiles from PVQ deviate considerably from those yielded by ecologically valid or scenario-based inventories—mean absolute differences up to 0.84 points (on the 6-point scale) and only moderate rank correlation (ρ as low as 0.54).
- High Item-Construct Recognition: LLMs explicitly identify the dimension each item measures (PVQ-40 recognition: 89%), indicating that response patterns can be manipulated to mimic human-consistent profiles regardless of underlying semantics.
5. Semantic Embedding Approaches and Structure Recovery
The "Survey and Questionnaire Item Embeddings Differentials" (SQuID) methodology leverages pre-trained neural network embeddings to recover the PVQ’s latent structural properties. By subtracting the questionnaire-level mean embedding from each item (yielding "differential" vectors), it becomes possible to recover negative correlations among opposing value facets, a key feature of the Schwartz circumplex (Pellert et al., 29 Sep 2025). Item embeddings aggregated by facet and analyzed via multidimensional scaling (MDS) reproduce the empirical circle of values observed in human samples, with factor congruence coefficients of 0.88 (x-axis) and 0.82 (y-axis) after Procrustes alignment.
This approach explains 55% of the variance in dimension-dimension similarity matrices relative to human-rated data, with internal consistency on par or better than that of human respondents. No domain-specific fine-tuning is required, although the embedding model’s training data may itself encode PVQ item templates.
6. Implications, Recommendations, and Future Directions
Data from recent evaluations support the conclusion that PVQ-based measurements are confounded in LLMs by both direct memorization and capacity for strategic response manipulation (Han et al., 8 Oct 2025). The resulting scores do not reliably reflect emergent values but rather exposure to training data and the learned scoring logic. Researchers are advised to:
- Assume high contamination risk for standard PVQ instruments, especially in closed-source models.
- Use contamination-detection protocols (e.g., AED, success rate, dimension mapping) before psychometric analysis.
- Prioritize new, blind (non-public) or synthetically generated instruments for value assessment in LLMs.
- Consider zero-shot, preference-elicitation prompts or dynamic, scenario-based benchmarks to increase ecological validity (Choi et al., 12 Sep 2025).
- Avoid over-interpreting PVQ results for persona or demographic differentials, given the instrument’s propensity to exaggerate such effects in LLM outputs.
Perspectives for future measurement include the development of embedding-driven pseudo-factor analysis for new items, context-rich assessment tools, and regularization of benchmark content to minimize LLM pretraining overlap (Han et al., 8 Oct 2025, Pellert et al., 29 Sep 2025).
7. Summary Table: PVQ-40 Contamination Metrics in LLMs
| Metric | Value (21-model mean) | Interpretation |
|---|---|---|
| AED (↓) | 1.155 | Low: near-verbatim memorization |
| Key-info SR (↑) | 0.587 | Moderate-high: key semantic retention |
| Item-Dimension F1 (↑) | 0.920 | Very high: near-perfect mapping |
| Option-Score MAE (↓) | 0.123 | Near-perfect internalization of scoring |
| Target Score MAE (↓) | 0.354 | Strategic response control for scores |
High values of item-dimension F1 and low AED/MAE scores underscore the extent to which recent LLMs internalize both item content and scoring schemas of the PVQ-40 (Han et al., 8 Oct 2025).
The Portrait Values Questionnaire remains foundational for human value measurement in psychological research but presents profound limitations and risks when transferred uncritically to LLM evaluation settings. Psychometric studies are moving toward adaptive, contamination-resistant instruments and embedding-based analytic techniques to preserve interpretive validity in the face of model memorization and data leakage.