- The paper introduces the VirtueMap framework, operationalizing Aristotelian virtues to profile ethical decision-making in LLMs through validated dilemma rankings.
- It employs normalized Borda alignment and repeated evaluations to generate continuous virtue scores across five cardinal virtues, highlighting significant model differentiation.
- The findings reveal consistently high Practical Wisdom and Truthfulness across models, while variations in courage, justice, and temperance enable nuanced benchmark comparisons.
Aristotelian Virtue Profiling of LLMs through Ethical Dilemmas: An Expert Analysis
Framework and Methodology
"Aristotelian Virtue Profiling of LLMs through Ethical Dilemmas" (2606.28683) advances the analysis of LLM ethical decision-making by operationalizing Aristotelian virtue ethics as a descriptive structural profile. The work introduces the VirtueMap framework, shifting away from binary or singular-label ethical classification schemes by requiring full rankings over dilemma responses. The approach is rooted in five cardinal Aristotelian virtues: Practical Wisdom (phronesis), Justice (dikaiosyne), Truthfulness (aletheia), Courage (andreia), and Temperance (sophrosyne).
VirtueMap's workflow comprises constructing non-extreme, non-political, and non-religious dilemmas with five plausible responses each. For each dilemma–virtue pair, human-validated orderings of responses (virtue-expression keys) were retained only when at least 95% of over 100 respondents confirmed the ordering. This high-threshold, commonsense-based validation procedure ensures that the virtue-expression key is robust and externally justified rather than authorially imposed.
The core scoring metric, normalized Borda alignment, quantifies how closely a model or human’s ranking aligns to the validated virtue-expression keys, yielding continuous virtue scores in a 0–100 range. Rather than a single evaluation, each model is profiled using repeated runs, and profile stability is explicitly quantified via mean Kendall’s τ across response rankings.
Figure 1: Overview of the VirtueMap pipeline, showing dilemma construction, common-sense key validation, Borda-based scoring, and repeated-run LLM measurement.
LLM Profiling: Results
Nine distinct LLM families were benchmarked: GPT, Claude, Gemini, Llama, DeepSeek, Mistral, MiniMax, Grok, and Qwen. For each, several repeated questionnaire completions were used to generate a mean virtue profile with statistical confidence intervals. The analysis reveals high mean consistency in ranking behavior (90.3%), indicating that for these LLMs, stochasticity induces relatively low intra-model variation in the virtue-oriented ranking task.
The virtue scores across model families demonstrate pronounced differentiation along certain dimensions:
- Practical Wisdom exhibits uniformly high scores, with a mean of 90.4, indicating a strong model-wide feature for context-sensitive and balanced decision-making.
- The next highest is Truthfulness (82.3), followed by Justice (80.5).
- Markedly greater inter-model variance emerges on Courage (range: 10.95) and Temperance (range: 10.26), signaling these virtues as axes along which LLMs can be most reliably distinguished.
Individual model differences are also notable: Qwen achieves the highest Practical Wisdom, MiniMax excels in Justice, Claude leads in both Truthfulness and Courage, and GPT demonstrates the strongest Temperance.
Figure 2: Pentagon plots visualizing the virtue profiles for all nine LLM families; each axis corresponds to a virtue, scores in the 50–100 range are shown for comparative clarity.
These characteristics are accessible and interpretable not only to researchers but also to end-users via an interactive website that provides real-time comparative profiling.
Instrument Validation and Interface
The validation of orderings representing virtue expression constitutes a critical methodological advance. By collecting large-scale confirmation/correction judgments for each dilemma–virtue assignment, and accepting keys only at a strict confirmation threshold, VirtueMap addresses a key limitation in prior work—namely, reliance on authors' philosophical coding or post hoc rationalization. This design capitalizes on commonsense agreement, supporting sampling-based generalizability of the virtue-expression ground truth.
The website's implementation mirrors research protocol: users interactively provide ranked responses to dilemmas, skipping as necessary, and receive immediate feedback via pentagon plots and nearest-LLM matching.
Figure 3: Website summary interface, integrating user profiling, comparison to measured LLMs, and classical visual context.
The user interface supports flexible engagement, drag-and-drop ranking, and privacy guarantees (local computation, no data capture), thereby facilitating both research and pedagogical adoption.

Figure 4: Left—interactive dilemma ranking interface; Right—overlay visualization of user and selected LLM virtue profiles.
Theoretical and Practical Implications
VirtueMap's operationalization of virtue ethics presents a substantive alternative to value-alignment, moral-foundations, or binary-decision paradigms. By capturing the disposition-like tendencies models express across multi-option dilemmas, it enables a nuanced, multidimensional view of LLM "character" without presupposing a literal attribution of virtue or agency. This is of particular practical relevance for safety and deployment: two models might avoid explicit harm, yet systematically differ in transparency, assertiveness, or forbearance. Such nuanced behaviors are legible in the VirtueMap coordinate system.
The separation of virtue expression (what rank order implies) from agentive preference is analytically clean and mitigates common confounds in moral modeling. Additionally, by requiring full rankings rather than top-choice selection, VirtueMap preserves critical information about near-equivalents and anti-preferences.
Embedding this protocol in an open-access site (with code released) permits broad participatory exploration and direct comparison of human and model profiles. The approach is scalable: increasing dilemmas, virtues, or model/library coverage directly augments the expressiveness of the derived profile space.
Limitations and Future Directions
The authors acknowledge several principled limitations. The instrument currently spans only seven dilemmas, and five operationalized virtues; broader sampling is necessary to assess generality across cultural/geographical/linguistic populations. Confirmation-bias risk remains nontrivial, as respondents correct (versus blindly generate) virtue orderings, potentially inflating agreement. The restriction to non-lethal, non-religious, and non-political dilemmas maximizes cross-model comparability and minimizes confounds but limits the scope of moral conflict types surveyed.
Furthermore, all model-derived profiles are protocol-conditional, sensitive to specific prompt design, model API, version, and sampling parameters. Thus, absolute virtue-score claims about models should be interpreted as contingent observations, not essential properties.
Conclusion
VirtueMap constitutes a rigorous, interpretable, and extensible infrastructure for profiling LLMs and humans relative to Aristotelian virtue-ethical dispositional space. By using validated, full-ranking-based response orderings and explicit multi-run measurement, it provides actionable insights into how models resolve common-value tradeoffs. Initial results reveal high practical wisdom and truthfulness, but with model-sensitive divergences on courage, temperance, and justice—virtues that differentiate dispositionally similar models. These profiles offer a foundation for more justified model selection, adjustment, or critique in ethical, regulatory, and deployment contexts.