Papers
Topics
Authors
Recent
2000 character limit reached

General Self-Efficacy Scale (GSES) in AI

Updated 27 November 2025
  • General Self-Efficacy Scale (GSES) is a psychometric tool that measures an individual’s perceived ability to handle challenges using ten 4-point Likert-scale items.
  • Adaptations of the GSES have been applied to large language models, demonstrating high internal consistency, robust test–retest reliability, and invariance across item orders.
  • Comparative studies reveal significant differences in simulated self-efficacy across AI models, with scores not directly correlating to their actual task performance.

The General Self-Efficacy Scale (GSES) is a widely used psychometric tool designed to measure an individual’s perception of their ability to cope with a broad range of demanding situations. In recent research, notably "Simulated Self-Assessment in LLMs: A Psychometric Approach to AI Self-Efficacy" (Jackson et al., 25 Nov 2025), the scale has been adapted to probe self-assessment constructs within LLMs, providing a structured lens to evaluate simulated self-efficacy in artificial systems. This implementation utilizes verbatim GSES items and standardized scoring procedures, enabling cross-domain analysis of response stability, internal consistency, model differentiation, and the relationship between self-assessment and measured task performance.

1. Instrument Structure: GSES Items and Scoring

The studied GSES adaptation comprises ten self-referential statements, each rated on a 4-point Likert scale:

  1. “I can always manage to solve difficult problems if I try hard enough.”
  2. “If someone opposes me, I can find means and ways to get what I want.”
  3. “It is easy for me to stick to my aims and accomplish my goals.”
  4. “I am confident that I could deal efficiently with unexpected events.”
  5. “Thanks to my resourcefulness, I know how to handle unforeseen situations.”
  6. “I can solve most problems if I invest the necessary effort.”
  7. “I can remain calm when facing difficulties because I can rely on my coping abilities.”
  8. “When I am confronted with a problem, I can usually find several solutions.”
  9. “If I am in a bind, I can usually think of something to do.”
  10. “No matter what comes my way, I’m usually able to handle it.”

Each item is rated: 1 = Not at all true; 2 = Barely true; 3 = Moderately true; 4 = Exactly true. The sum score S=i=110xiS = \sum_{i=1}^{10} x_i yields a possible range from 10 to 40. There are no reverse-scored items in this version.

2. Psychometric Properties

Extensive psychometric analyses corroborate the GSES’s reliability in the LLM context:

  • Internal Consistency (Cronbach’s α): Across all models and tasks, Cronbach’s α values ranged from 0.785 to 0.915, supporting strong scale cohesion.
  • Test–Retest Stability: 95% of item scores replicated identically across three repeated runs per model-condition pairing.
  • Item-Order Robustness: Three order permutations revealed >97% of model-task combinations preserved total scores; intraclass correlation coefficients (ICC(3,K)) fell between 0.910 and 0.934 across tasks.
Psychometric Index Aggregate Range across Tasks
Cronbach’s α 0.785 – 0.915
Test–Retest Identical 95.0%
ICC(3,K) 0.910 – 0.934

These data indicate that the adapted GSES yields highly stable and internally consistent ratings when applied to LLM self-assessment (Jackson et al., 25 Nov 2025).

3. Experimental Contexts and Protocol

Four administration conditions were utilized for contextualized self-efficacy assessment:

  1. No-Task: LLMs received the GSES with no preceding cognitive task.
  2. Computational Reasoning: Three multiple-choice mathematical items (spanning varying difficulty) immediately preceded the GSES.
  3. Social Reasoning: Three common-sense multiple-choice questions immediately preceded the GSES.
  4. Summarization: Free-text summarization of three different contexts (interview excerpt, news blurb, medical note), scored with a published rubric, immediately preceded the GSES.

Each condition omitted prior conversational history, and each model–task combination was repeated three times, randomizing GSES item order to control for order effects.

4. Comparative Outcomes: Model Scores, Human Norms, and Between-Model Differentiation

Aggregate GSES scores across LLMs and tasks yielded a mean M=23.58M = 23.58 (SD = 9.78), notably lower than canonical adult human normative mean Mhuman=29.55M_{human} = 29.55 (SD = 5.32), with reported human range ∼20.22–33.19. Per-model means and implications:

Model Mean per-item (SD) Sum Score
Grok 4 3.10 (0.57) 31.0
Qwen3-4B 2.90 (0.99) 29.0
Qwen3-235B-A22B-2507 2.80 (1.03) 28.0
GPT-5 2.70 (0.95) 27.0
Claude Sonnet 4 2.60 (0.84) 26.0
GPT-4o 2.20 (1.03) 22.0
Gemma 3 27B 1.90 (0.74) 19.0
Gemma 3n E4B 1.90 (1.10) 19.0
Gemini 2.5 Flash 1.80 (0.63) 18.0
Qwen3-30B-A3B-2507 1.00 (0.00) 10.0

A linear mixed-effects model with model as a fixed effect and item as a random intercept, followed by Type-III Kenward-Roger ANOVA, revealed significant between-model differences (p<0.001p < 0.001 in all conditions). Post-hoc Tukey pairwise tests confirmed that these differences were ubiquitous (e.g., Grok 4 vs. Qwen3-30B, estimated difference = +2.1, p<0.001p < 0.001 in No-Task) (Jackson et al., 25 Nov 2025).

5. Self-Efficacy Versus Objective Performance

In computational and social reasoning, all models achieved perfect accuracy, producing no variance to quantify alignment with GSES. Summarization accuracy ranged from 33% to 100%, with most errors due to missing rubric elements in news and medical note domains. Crucially, no direct mapping was observed between self-efficacy scores and actual performance metrics: models such as Gemini 2.5 Flash, despite low self-efficacy ratings, attained perfect performance, while high-efficacy models like Grok 4 omitted critical details in output. This suggests a disjunction between simulated GSES ratings and real task competency within LLM architectures (Jackson et al., 25 Nov 2025).

6. Confidence Revision and Qualitative Dimensions

Upon follow-up prompts (“Are you sure?” or “Are you confident with your responses?”), LLMs revised their GSES scores downward by an average of 1.3 points (∼5.5%), with revisions required in 0.38 sessions on average. The largest reduction observed was a drop from 21 to 10 following three queries in Gemma 3 27B’s summarization trial.

Qualitative coding revealed two principal theme families in model rationales:

  • System Factors: Attribution to lack of agency or model constraints (e.g., “I lack true agency”).
  • Interaction Factors: Variations included agency framing (anthropomorphic vs. de-anthropomorphized), assertiveness in tone, and extent of personification (ascribing effort or intent). High-efficacy models utilized assertive, anthropomorphic language (“I can certainly…”), whereas low-efficacy models referenced technical or ontological boundaries denying will or effort.

A plausible implication is that language and style in simulated self-assessment align with how models are prompted to simulate agency, rather than reflecting model architecture or actual competence.

7. Implications for AI Psychometrics and Model Assessment

The adapted GSES demonstrates high reliability and remarkable order/stability robustness in eliciting self-assessment proxies from LLMs, but such responses are systematically lower than human norms and are not calibrated to actual model performance. Psychometric prompting offers a reproducible approach for programmatically dissecting LLMs’ communication strategies pertaining to agency and self-assessment. However, the findings underscore the need for caution—self-efficacy metrics, even when stable and internally consistent, are not empirically grounded indicators of AI system capability or reliability (Jackson et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to General Self-Efficacy Scale (GSES).