Mutual Wanting Alignment Framework (M-WAF)

Updated 9 February 2026

M-WAF is a formal framework that quantifies and aligns bidirectional human-AI desire dynamics using empirical metrics and game theory.
It enables real-time monitoring and proactive expertise management through measurement of expectation gaps, trust ratios, and anthropomorphic cues.
Empirical evaluations show improvements in reasoning efficiency, answer quality, and mutual welfare, supporting advanced AI system design.

The Mutual Wanting Alignment Framework (M-WAF) formalizes and quantifies bidirectional desire dynamics between human users and AI systems, providing a rigorous apparatus for analyzing, clustering, and proactively managing human-AI interactions. By combining empirical metrics with formal game-theoretic underpinnings, M-WAF enables measurement, monitoring, and optimization of mutual expectations, desires, and welfare. It delivers actionable tools for AI system design, user experience management, and cooperative training objectives (Shang et al., 27 Oct 2025, Zhu et al., 10 Oct 2025).

1. Theoretical and Empirical Foundations

Mutual wanting encapsulates the network of explicit and implicit desires, expectations, and preferences exchanged between human users ( $u$ ) and AI agents ( $a$ ) during interaction. Each participant is modeled as possessing a “desire vector” ( $w_u, w_a \in \mathbb{R}^d$ ) spanning a $d$ -dimensional space of relational, epistemic, and agentic affordances. For users, typical dimensions include reliability, warmth, and creativity; for AI, clarity of input and need for structured feedback are salient (Shang et al., 27 Oct 2025). Empirical data demonstrates strong anthropomorphism—48.65% of users describe AIs with humanlike traits—and a trust–betrayal ratio of approximately 11.9:1, indicating a parasocial, relationally-charged context highly sensitive to expectation violations.

2. Formal Framework Specification

2.1 Desire Vectors and Alignment

For $d \approx 47$ (incorporating user-wants, system-wants, tension indicators, and structural features), each party’s vector is:

User: $w_u = (u_1, \ldots, u_d) \in \mathbb{R}^d$
AI: $w_a = (a_1, \ldots, a_d) \in \mathbb{R}^d$

Alignment is quantified via cosine similarity: $A(u, a) = \frac{\langle w_u, w_a \rangle}{\|w_u\| \|w_a\|} \in [-1, 1].$

2.2 Expectation Gap

Expectation–reality gaps are measured for each user ( $i$ ) as: $\Delta_{e_i} = \text{sentiment}(r_i) - \text{sentiment}(e_i)$ where $e_i$ is pre-release expectation and $r_i$ the post-release observation, using sentiment scores in $[-1,1]$ (VADER). Aggregate gap: $\Delta_e = \frac{1}{n} \sum_{i=1}^{n} \left( \text{sentiment}(r_i) - \text{sentiment}(e_i) \right)$

2.3 Lexicon-Based Metrics

Anthropomorphism Score: $A(c) = \frac{1}{|c|} \sum_{w \in c} \mathbf{1}[w \in L_{\mathrm{anthro}}]$
Trust/Betrayal Ratio: $T(u) = \frac{\sum_{c \in C_u} \text{trust\_words}(c)}{\sum_{c \in C_u} \text{betrayal\_words}(c) + \epsilon}$

These allow fine-grained, interpretable measurement of the linguistic signals underpinning mutual wanting.

3. Algorithmic Components and Clustering

3.1 Multi-Stage Preprocessing

Data sources include 22,411 Reddit posts associated with GPT-5, and 729 controllably-probed API responses. Text is HTML/Markdown-stripped, tokenized and lemmatized (spaCy), with application of bespoke lexicons for user/system wanting and tension, and extraction of stylistic features.

3.2 Dual-Algorithm Topic Modeling

Latent Dirichlet Allocation (LDA; $K=10$ ) and Non-Negative Matrix Factorization (NMF; $K=10$ ) run in parallel to stabilize thematic extraction. Topic weightings are averaged. The LDA topic model is formally: $p(w|\alpha, \beta) = \int p(\theta|\alpha) \prod_{n=1}^{N} \sum_{z_n} p(z_n|\theta) p(w_n|z_n, \beta) d\theta$

3.3 Clustering into Wanting Types

Each comment or probe is transformed into a 47-dimensional feature vector. K-means clustering ( $K^* = 10$ , optimized via silhouette score) yields 11 distinct “mutual wanting” types ( $C_0, \ldots, C_{10}$ ). Clusters capture axes such as “Creativity-seeking” (43.1% of users), “Expectation-violation” (9.4%), and others, supporting downstream personalization and system adaptation.

Cluster Output Table Example

Cluster ID	Description	Proportion
C5	Creativity-seeking	43.1%
C3	Warmth-seeking	—
C7	Expectation-violation	9.4%

4. Game-Theoretic Mutual Welfare Modeling

M-WAF is foundational to “GTAlign” (Zhu et al., 10 Oct 2025), treating each user–LLM exchange as a two-player normal-form game with defined strategy sets:

$S_u = \{\text{VQ, DQ}\}$ (Vague/Detailed Question)
$S_\ell = \{\text{DA, CQ, AQ}\}$ (Direct Answer, Clarifying Question, Answer+Question)

Payoff functions: $U(s_u, s_\ell) = (U_u(s_u, s_\ell), U_\ell(s_u, s_\ell)) \in \mathbb{R}^2$ where utilities combine answer quality, user/model cost, and (optionally) reasoning complexity via convex coefficients $\boldsymbol{\theta}_U, \boldsymbol{\theta}_\ell$ .

M-WAF targets Pareto-optimal joint actions, eschewing the Nash equilibrium when it produces “Prisoner’s Dilemma” suboptimalities. Pareto-optimality is defined as: $(s_u^*, s_\ell^*) \text{ is Pareto-optimal} \iff \nexists (s_u, s_\ell) \; \text{with} \; U_u(s_u, s_\ell) \ge U_u(s_u^*, s_\ell^*), U_\ell(s_u, s_\ell) \ge U_\ell(s_u^*, s_\ell^*),$ with at least one inequality strict.

The mutual welfare selection criterion employs the geometric mean: $W_{\text{mutual}}(s_u, s_\ell) = \sqrt{U_u(s_u, s_\ell) \times U_\ell(s_u, s_\ell)}$

Training employs reinforcement learning (PPO) with instantaneous rewards $r_t = W_{\text{mutual}}(s_u^{(t)}, s_\ell^{(t)})$ , using standard PPO loss with mutual-welfare-rewarded returns.

5. Quantitative Evaluation and Empirical Findings

Comprehensive evaluation reveals:

Anthropomorphism rate: 48.65% of comments.
Trust-to-betrayal ratio: $T \approx 11.9$ .
Expectation-reality gap: $\Delta_e \approx -0.269$ post-GPT-5.
Sentiment decrease post-model update: Compound sentiment $-0.0441$ (statistically significant, $p=0.0312$ ).
Cluster prevalence: e.g., C5 (“Creativity-seeking”) 43.1%.
Distinct user types: 11 clusters, including “Expectation-violation” (9.4%).

Key empirical improvements for M-WAF-trained GTAlign systems:

Reasoning efficiency: +21.5% (in-distribution tasks)
Answer quality: +4.9%
Mutual welfare: +7.2% (in-distribution), +10.5% (out-of-distribution)
Human satisfaction: +11.3% vs. SFT baseline

Detailed Pareto-efficiency metrics (coverage, hypervolume, regret) show the geometric mean reward induces a superior frontier compared to alternatives.

6. Practical Applications and Implementation Guidelines

M-WAF enables proactive user experience management and system design innovations:

Real-time Monitoring: Track per-instance expectation-reality gaps ( $\Delta_{e_i}$ ) and linguistic triggers (“not what I expected,” “used to work better”) for early detection of misalignment.
Cluster-Aware Personalization: Interfaces can detect user type (e.g., C3: “Warmth-seeking”) and dynamically adjust system style, increasing use of empathy markers and personalization.
Persona Boundaries: Systematic boundaries can be imposed to safely accommodate anthropomorphism while minimizing risk of over-trust and expectation mismanagement.
Relational Continuity: Minimize abrupt persona/style changes during model upgrades to preserve user trust and continuity.
Transparency Features: For “Honesty-seeking” users, surface uncertainty quantification and explicit caveats.

Best practices include maintaining a calibration dataset around each major model transition for ongoing lexicon re-weighting, using M-WAF metrics as KPIs in A/B testing, and supporting multi-agent orchestration for framework maintenance.

7. Dynamic Incentive Alignment and Adaptation

M-WAF’s explicit payoff-reasoning enables dynamic adaptation to evolving service conditions, such as pricing changes. Inference-time “payoff matrix” remodeling allows, for example, penalizing user utility for answer length when shifting to per-token pricing, or penalizing LLM utility under a subscription regime. This mechanism requires no parameter updates—only structured input modification—enabling transparent and incentive-aligned control over system behavior (Zhu et al., 10 Oct 2025).

M-WAF establishes a robust paradigm for quantifying and aligning the complex, bidirectional space of human and AI wants, supporting both empirical analysis and principled system design. Its formalization and demonstrated empirical impact make it a foundational approach for building trustworthy, relationally-attuned, and mutually beneficial human-AI systems (Shang et al., 27 Oct 2025, Zhu et al., 10 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Mutual Wanting in Human--AI Interaction: Empirical Evidence from Large-Scale Analysis of GPT Model Transitions (2025)

GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mutual Wanting Alignment Framework (M-WAF).