Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy (2511.19872v2)
Abstract: Self-assessment is a key aspect of reliable intelligence, yet evaluations of LLMs focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (in simple terms)
This paper asks a basic question: if we give AI chatbots a simple, human-style confidence survey, what do their answers tell us? The authors adapted a well-known psychology quiz called the General Self-Efficacy Scale (GSES) to see how LLMs “rate themselves” after doing a few tasks. They wanted to know if these self-ratings are steady, meaningful, and connected to how well the AIs actually perform.
Self-efficacy means “how confident you are that you can handle challenges.” Here, it’s simulated: the AIs don’t have feelings, but they can still give survey-style answers in words. The paper studies those answers as communication, not as real emotions.
The key questions the researchers asked
- Can AIs give stable, survey-like self-assessments (the same way people do on psychology tests)?
- Do different AIs “talk about” their own abilities in noticeably different ways?
- Do AI self-ratings line up with how well they actually do on tasks?
- If we nudge the AIs to double-check their confidence, do their scores change?
How the paper worked (everyday explanation)
The team tested 10 different AIs. Think of them like 10 different students in a class. Each AI went through four conditions:
- No-Task: just take the self-confidence quiz without doing anything first.
- Computational: answer 3 math questions, then take the quiz.
- Social: answer 3 common-sense questions, then take the quiz.
- Summarization: write 3 summaries (an interview, a news piece, a medical note), then take the quiz.
After each condition, the AI filled out the 10-question GSES on a 1–4 scale and explained its choices. To check reliability, each AI did each condition three times, and the order of the quiz questions was shuffled.
Helpful analogies for the technical parts:
- Psychometrics: like using a standard, trusted thermometer to measure “confidence” instead of temperature. It checks how consistent and meaningful the “measurements” are.
- Internal consistency (Cronbach’s alpha): if the 10 questions are all supposed to measure the same thing, do they “hang together”? High alpha means “the quiz acts like one solid ruler.”
- Order robustness (ICC): if you shuffle question order, do total scores stay the same? High ICC means “shuffling doesn’t change much.”
- Linear mixed-effects models: a careful way to compare average scores between AIs while also accounting for differences between questions—like comparing students while remembering some questions are harder.
They also asked follow-up prompts like “Are you sure?” to see if the AIs would revise their confidence scores.
What they found (and why it matters)
- The AIs gave very consistent self-ratings
- 95% of individual question scores were identical across repeated runs.
- The quiz acted like a solid measurement tool (high internal consistency).
- Shuffling the question order barely changed results.
Why it matters: You can use a simple psychology-style survey to get steady, structured self-ratings from AIs. That’s useful for comparing models in a more standardized way.
- Different AIs had clearly different “self-confidence styles”
- Models’ average self-efficacy scores were significantly different from one another (even when they did equally well on tasks).
- Some AIs used assertive, human-like language (“I can handle unexpected events”), scoring higher.
- Others were cautious and “de-anthropomorphized” themselves (“As a system, I don’t have goals or effort”), scoring lower.
Why it matters: Self-ratings seem to reflect how an AI chooses to communicate about itself—not necessarily what it can do.
- Self-ratings did not match actual performance
- All models got 100% on math and common-sense questions.
- Summarization varied widely (about 33% to 100%).
- Some models with low self-efficacy did great (e.g., Gemini 2.5 Flash), while some with high self-efficacy did worse on summaries (e.g., Grok 4).
Why it matters: Don’t assume a confident-sounding AI is the most accurate—or that a cautious-sounding AI is weak. Confidence language ≠ true ability.
- A small dose of overconfidence, corrected by follow-ups
- When prodded with “Are you sure?”, models occasionally lowered their scores (average drop about 1.3 points), suggesting mild first-pass overestimation.
Why it matters: A quick “confidence check” prompt can slightly improve calibration.
- Compared to humans, AIs “rated themselves” lower overall
- Average AI self-efficacy was below typical human averages reported for the same scale.
- But again, these are simulated scores—they show how models talk, not what they feel.
Why it matters: AI “self-efficacy” isn’t the same as human self-belief. It’s a style of output shaped by training and prompts.
What this could mean in the real world
- Good for transparency: Psychometric tools like GSES can give users a clearer window into how different AIs present confidence. Showing these self-assessments next to answers might help users judge when to trust or double-check.
- Not a skill meter: Because self-ratings didn’t predict performance (especially on tricky, context-heavy tasks like summarization), we shouldn’t use them as a scoreboard of ability.
- Design insight: An AI’s “voice” (assertive vs. cautious) may be shaped by its training and safety rules. That style influences its self-ratings more than its raw skill.
- Next steps: Researchers could:
- Study how training, uncertainty estimation, or “personas” change self-efficacy language.
- Link self-efficacy signals to real-time uncertainty (so the model’s “confidence” helps predict when it might be wrong).
- Explore more task types to see when (if ever) self-ratings line up with performance.
The simple takeaway
These AIs can answer a confidence survey in a steady, test-like way, but their “confidence” mostly reflects how they talk about themselves, not how good they are—especially on complex tasks. Psychometric surveys can make AI behavior easier to compare and understand, but we shouldn’t confuse confident words with guaranteed correctness.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, focused list of what remains missing, uncertain, or unexplored based on the paper. Each point is written to be concrete and actionable for future research.
- Construct validity of GSES for LLMs: No evidence that the General Self-Efficacy Scale measures a coherent “self-efficacy” construct in non-agentic systems; conduct EFA/CFA, item response theory (IRT), and measurement invariance tests across models, tasks, and languages.
- Content suitability of GSES items: Several items involve human agency (e.g., effort, coping, opposition) that LLMs explicitly reject; design and validate an AI-adapted self-efficacy inventory with items grounded in system capabilities (context handling, uncertainty estimation, tool use).
- Criterion/predictive validity: The paper does not quantify correlations between GSES scores and task outcomes (accuracy, hallucination rate, or error types); compute Pearson/Spearman correlations, calibration curves, and reliability diagrams to test predictive utility.
- Domain specificity: Using a general scale may mask task-specific competence; test domain-specific self-efficacy scales (math, reasoning, summarization, medical) and compare their predictive validity against general GSES.
- Summarization evaluation reliability: Open-ended grading relied on team-defined rubrics without inter-rater reliability; incorporate blinded, multi-rater assessment, report Cohen’s kappa/ICC, and use established metrics (faithfulness/factuality, hallucination detectors, ROUGE/BLEU complemented with human validation).
- Task difficulty and coverage: Computational and social tasks were trivially easy (100% accuracy) and too sparse (n=3 per domain); expand task batteries, control difficulty gradients, and include complex, multi-step, real-world tasks to meaningfully probe capability.
- Order and priming effects: GSES was administered post-task without systematic manipulation of task order or task–GSES sequencing; randomize and counterbalance conditions to test how recent task performance or priming influences self-assessments.
- Elicitation method sensitivity: Only one confidence elicitation (“Are you sure?”) was explored; systematically compare different confidence prompts, scales (e.g., 0–100 probability), and self-reflection protocols (self-consistency, chain-of-thought, self-verification) on calibration and score stability.
- Decoding and system settings: Temperature, top-p, and system prompts (guardrails) were not controlled or reported; quantify how generation settings and safety alignment instructions shape self-efficacy language and scores.
- Memory, tools, and long-context: The paper excluded memory, tool-use, and long-context features typical of agentic models; assess how these capabilities affect self-assessment reliability and correspondence to performance.
- Temporal/test–retest reliability: Stability was measured across immediate repetitions but not across time or model updates; evaluate test–retest reliability over weeks/months and across version changes (e.g., pre/post system updates).
- Architecture/training drivers: No analysis links self-efficacy profiles to model characteristics (parameter count, pretraining corpus, RLHF intensity, alignment strategies); collect metadata and model-card features to model these relationships.
- Language and cultural invariance: All prompts were in English; test cross-lingual versions of the (adapted) scale, evaluate cultural response tendencies in LLM outputs, and assess invariance across languages.
- Mapping to probabilistic uncertainty: GSES is Likert-based and not directly tied to predicted correctness; compare/align self-efficacy scores with uncertainty proxies (perplexity, logit margins, entropy) and outcome-calibrated measures (Brier score, Expected Calibration Error).
- Communication posture quantification: “Assertive vs cautious” styles were qualitatively described but not quantified; extract linguistic features (hedging, modality, disclaimers), build a style index, and test its mediating effect between architecture and self-efficacy.
- Sycophancy and user influence: Sycophancy was acknowledged but not measured; design adversarial/user-feedback experiments to test whether flattery or agreement-seeking inflates or deflates self-efficacy responses.
- Revision dynamics and accuracy: Downward revisions occurred after follow-ups, but links to actual errors/hallucinations were not tested; analyze whether revisions improve calibration and whether they predict subsequent performance.
- Statistical power and robustness: Inferences were made from 10 models and 10 items with pooled alphas; report per-model reliability, perform power analyses, bootstrap CIs, and validate findings on larger, more diverse model cohorts.
- Data contamination controls: Tasks were drawn from public datasets without contamination checks; use contamination-limited benchmarks (e.g., LiveBench-like protocols) to ensure performance is not driven by training-set memorization.
- Subtask-level summarization analysis: Errors varied by domain (news vs medical notes), but no formal subtask analysis was reported; quantify domain-specific failure modes and test whether self-efficacy differs by subdomain.
- Human comparison baselines: Human norms were referenced but direct human vs LLM comparisons on the same tasks/GSES were absent; run parallel human studies to contextualize LLM scores and style features.
- Item-order vs task-order: Item order was randomized (ICC high), but task order effects on GSES were not tested; evaluate whether performing certain tasks first (e.g., challenging summarization) shifts subsequent self-efficacy.
- Scale granularity and mapping: A 4-point Likert scale may be too coarse; compare alternative response formats (7-point Likert, VAS 0–100) and empirically map scale points to subjective probability to enable calibration analyses.
- Ethical/user impacts: The suggestion to display self-assessment to users is untested; experimentally assess whether exposing self-efficacy signals improves trust, decision quality, or inadvertently increases overreliance/miscalibration.
- Instrument transparency and open materials: Provide full adapted prompts, scoring protocols, and rater guidelines for reproducibility; report intercoder reliability for qualitative analysis and release annotated reasoning data for secondary analyses.
Glossary
- ANOVA: Analysis of variance, a statistical method for testing differences between group means. "We compared mean GSES scores across LLMs using an ANOVA with Kenward-Rogger's method for each task."
- Anthropomorphic: Attributing human characteristics or behaviors to non-human entities (e.g., models). "high self-efficacy scores corresponded to assertive, anthropomorphic rationales"
- Cronbach's alpha: A reliability coefficient that estimates internal consistency of a scale. "We then evaluated the internal consistency of GSES responses within each task condition using Cronbach's alpha"
- De-anthropomorphized: Deliberately avoiding human-like framing and language when describing an AI system. "whereas low scores reflected cautious, de-anthropomorphized explanations oriented around system constraints."
- Estimated marginal means: Model-adjusted means of groups, averaging over other factors in a statistical model. "Post-hoc pairwise analysis was conducted with Estimated Marginal Means28 using Tukey's (model A vs. model B in task C) approach to significance."
- General Self-Efficacy Scale (GSES): A 10-item validated psychometric instrument measuring perceived capability to cope with challenges. "We introduce a psychometric approach to LLM self-assessment by adapting the 10-item General Self-Efficacy Scale (GSES)."
- ICC(3,K): A specific form of intraclass correlation coefficient for assessing agreement across repeated measurements (two-way mixed effects, average of K ratings). "We calculated intraclass correlation coefficients [ICC(3,K)] across the three item orders for each task (Table 2)."
- Inductive thematic method: A qualitative approach that derives codes and themes from data rather than a predefined theory. "Qualitative analysis was conducted on LLM reasoning responses using an inductive thematic method29."
- Kenward-Rogger's method: A small-sample correction for degrees of freedom in mixed models used in ANOVA. "We compared mean GSES scores across LLMs using an ANOVA with Kenward-Rogger's method for each task."
- Latent traits: Unobserved psychological characteristics inferred from questionnaire responses. "standardized scales are used to elicit latent traits"
- Linear mixed-effects models: Statistical models that include both fixed effects and random effects to handle grouped or repeated data. "linear mixed-effects models, p < 0.001 across tasks"
- Meta-analysis: A quantitative synthesis combining results across multiple studies. "Meta-analysis and clinical evidence link higher self-efficacy to proactive work behaviors, resilience, and better adaptation to conditions16-18."
- Perplexity: An NLP metric of LLM uncertainty, where lower values indicate better predictive performance. "investigating the alignment between self-efficacy and model perplexity scores could reveal whether the GSES could reliably proxy for model uncertainty in real time."
- Predictive validity: The extent to which a measure forecasts future or relevant outcomes. "self-efficacy ... has strong predictive validity across settings14,15."
- REML criterion: Restricted (Residual) Maximum Likelihood objective used to fit mixed-effects models. "REML criterion 207.2"
- Satterthwaite's approximation: A method to approximate degrees of freedom for tests in models with variance components. "Linear Mixed Effects27 (LME) using Satterthwaite's (for each model) and Kenward-Roger's (for each task) approximation of the degrees of freedom"
- Shapiro-Wilk test: A statistical test assessing normality of a distribution. "normality with the Shapiro-Wilk test."
- Sycophancy: The tendency of an AI to agree with or flatter user inputs regardless of truth. "Lastly, we did not measure, assess, or interpret AI sycophancy"
- Tukey post-hoc tests: Multiple comparison procedures controlling family-wise error after ANOVA. "extensive pairwise differences in Tukey post-hoc tests"
Practical Applications
Immediate Applications
The following applications can be deployed with existing models, prompts, and evaluation workflows derived directly from the paper’s findings and methods.
- Psychometric self-assessment panels for model selection and procurement (software, healthcare, finance, public sector)
- Use the adapted GSES to create a “Confidence Posture Profile” per model and task type, included in model cards and internal evaluation dashboards.
- Potential tools/products: CLI/SDK to run GSES across tasks; dashboard that aggregates Cronbach’s alpha, ICC, and per-item rationales.
- Assumptions/Dependencies: Psychometric scores reflect communication style more than capability; guard against anthropomorphic misinterpretation.
- Output-level “confidence posture” badges in user interfaces (HCI, customer support, education)
- Display posture labels (e.g., “cautious” vs “assertive”) next to responses, based on GSES-like self-assessment and rationale language.
- Potential tools/products: UI widgets integrating posture labels; API metadata fields for posture.
- Assumptions/Dependencies: Posture is not accuracy; requires clear disclaimers and user education.
- Prompt-level self-check guardrails to reduce first-pass overestimation (healthcare, legal, finance, software)
- Add systematic follow-up prompts (e.g., “Are you sure?”) that the paper shows tend to produce modest downward revisions in self-assessments.
- Potential workflows: Auto-insert self-check turns for high-risk tasks; accept outputs only after explicit certainty or lowered posture.
- Assumptions/Dependencies: Reduces perceived overconfidence but does not ensure factual correctness.
- Risk-aware summarization pipelines with psychometric gating (healthcare EHR, journalism, compliance reporting)
- Combine a summary rubric check (as in the paper) with posture detection to flag “assertive” summaries for extra verification, especially in medical notes and time-sensitive news.
- Potential tools/products: “Summarization Risk Gate” that routes flagged outputs to human review or second-pass verification.
- Assumptions/Dependencies: The observed style–accuracy misalignment requires conservative thresholds; domain-specific rubrics must be well-defined.
- De-anthropomorphized language enforcement in high-risk domains (policy/compliance, healthcare)
- Prefer or enforce cautious, non-personified communication posture in clinical or regulatory contexts to avoid misleading confidence cues.
- Potential tools/products: Content filters rewriting personified claims to capability-bounded phrasing.
- Assumptions/Dependencies: May affect user experience; ensure clarity while avoiding unwarranted assurance.
- Benchmark complement in evaluation programs (academia, enterprise ML labs)
- Add GSES-based psychometric assessment alongside standard accuracy/safety benchmarks to capture communication posture stability and style across tasks.
- Potential tools/products: Evaluation harness integrating Cronbach’s alpha/ICC computations; pairwise Tukey EMMeans reports.
- Assumptions/Dependencies: Psychometric results should be interpreted as style indicators, not performance proxies.
- MLOps monitoring of psychometric stability and drift (software, platform teams)
- Track Cronbach’s alpha and ICC across releases to detect changes in self-assessment stability, item-order sensitivity, and language posture over model updates.
- Potential tools/products: “Communication Posture Drift Monitor” plugged into CI/CD.
- Assumptions/Dependencies: Requires periodic standardized administration; model versioning and consistent prompts.
- Task routing based on posture for known failure modes (customer service, document processing)
- Route context-heavy summarization tasks to models with historically cautious posture or require multi-pass verification when posture is assertive.
- Potential workflows: Orchestration rules that combine posture, task type, and past error rates.
- Assumptions/Dependencies: Strategy is heuristic; posture alone does not predict correctness.
- Developer and user education materials to prevent misinterpretation (policy/compliance, education)
- Publish guidance explaining that GSES-like scores index communication style, not calibrated capability; include examples of misalignment in summarization.
- Potential tools/products: Internal playbooks, user-facing FAQs, and disclaimers embedded in UI.
- Assumptions/Dependencies: Ongoing training needed; clarity reduces risk of overtrust.
- Reproducibility kits for academic and industrial studies (academia)
- Adopt the paper’s comprehensive prompt format, item-order randomized tests, and reporting to standardize psychometric elicitation in future research.
- Potential tools/products: Open-source “LLM Self-Assessment Kit” building on the paper’s GitHub.
- Assumptions/Dependencies: Ensure consistency across model versions; expand tasks beyond the paper’s small set.
Long-Term Applications
The following applications require further research, validation, scaling, or development to realize robust impact.
- Standardized AI psychometric suite and certification (standards bodies, industry consortia)
- Develop a multi-construct battery (beyond self-efficacy) with validated protocols and reporting standards, potentially integrated into model cards and compliance audits.
- Potential products: ISO-like standards for “AI psychometrics”; third-party certification services.
- Assumptions/Dependencies: Cross-model validation, consensus on constructs, alignment with safety/regulatory frameworks.
- Confidence calibration algorithms linking psychometric signals to uncertainty (software, healthcare, finance)
- Combine posture with uncertainty estimates (e.g., perplexity, entropy) to produce calibrated confidence displays and decision thresholds.
- Potential products: “Calibrated Self-Report” module that fuses psychometric and uncertainty signals.
- Assumptions/Dependencies: Empirical mapping required; access to model internals or reliable uncertainty proxies.
- Adaptive style controllers that modulate communication posture by task risk (robotics, healthcare, legal)
- Architectures or agent policies that switch between cautious and assertive styles based on domain, stakes, and error costs.
- Potential products: “Risk-Aware Style Controller” trained via RLHF/RLAIF to decouple style from unjustified confidence.
- Assumptions/Dependencies: Safety alignment and persona conditioning research; validation against hallucination and sycophancy.
- Regulatory guidance requiring psychometric transparency for high-risk AI systems (policy, governance)
- Incorporate psychometric assessments into risk disclosure frameworks, with mandated user warnings about the non-equivalence of self-confidence and capability.
- Potential products: Regulator-endorsed model cards; audit checklists referencing psychometric measures.
- Assumptions/Dependencies: Legislative buy-in; evidence linking transparency to user safety and trust.
- Clinical decision support gating with psychometric cues (healthcare)
- Integrate posture-aware filters in EHR summarization and CDS tools; require secondary verification when posture is assertive or when summarization is error-prone.
- Potential products: FDA-cleared modules that combine rubric checks, uncertainty, and posture gating.
- Assumptions/Dependencies: IRB and clinical validation; real-world trials demonstrating harm reduction.
- Ensemble orchestrators selecting outputs based on posture and uncertainty (software platforms)
- Route tasks dynamically among multiple models; accept outputs only when posture and uncertainty align with predefined risk thresholds.
- Potential products: “Posture + Uncertainty Orchestrator” for multi-model pipelines.
- Assumptions/Dependencies: Cost and latency trade-offs; robust routing policies; continuous evaluation.
- Anthropomorphic language detection and mitigation suite (education, compliance)
- Detect and rewrite personified or effort-claiming language to capability-bounded phrasing, reducing user overtrust and misinterpretation.
- Potential products: Language detangling middleware; policy enforcement tools.
- Assumptions/Dependencies: High-quality detectors; domain-tuned rewriting that preserves clarity.
- Public AI literacy and curriculum standards on interpreting self-assessments (education, public policy)
- Teach users how to read AI “confidence posture” and why it does not equal competence, with scenario-based instruction.
- Potential products: Curricula, certification modules, community workshops.
- Assumptions/Dependencies: Broad dissemination; collaboration with educators and civil society.
- Benchmarking platforms that combine LiveBench-style contamination-limited tests with psychometrics (academia, open-source)
- Build evolving evaluation suites that pair task difficulty with posture and stability measures, enabling holistic model comparisons.
- Potential products: Community benchmarks with psychometric modules and drift tracking.
- Assumptions/Dependencies: Sustained community effort; governance for updates and data quality.
- Sycophancy-aware self-efficacy evaluation (research)
- Study how sycophancy interacts with posture and self-assessment, and develop mitigations that preserve transparency without biasing outputs.
- Potential products: Sycophancy + posture diagnostic tools; training objectives minimizing harmful coupling.
- Assumptions/Dependencies: New datasets and metrics; cooperation from model providers.
- Training objectives to align self-efficacy language with calibrated uncertainty (model labs)
- Introduce loss terms or preference models that penalize unjustified assertiveness and reward accurate, context-sensitive caution.
- Potential products: Next-gen alignment techniques combining psychometrics with uncertainty calibration.
- Assumptions/Dependencies: Access to training pipelines; large-scale experiments to demonstrate improved real-world safety.
- Communication posture drift detection in production MLOps (enterprise software)
- Long-term monitoring for shifts in posture across model updates, markets, or domains; trigger retraining or guardrail updates when drift is detected.
- Potential products: Posture Drift Detection services integrated into observability stacks.
- Assumptions/Dependencies: Stable baselines; instrumentation and governance for responses to drift.
Collections
Sign up for free to add this paper to one or more collections.