Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mutual Wanting Alignment Framework (M-WAF)

Updated 9 February 2026
  • M-WAF is a formal framework that quantifies and aligns bidirectional human-AI desire dynamics using empirical metrics and game theory.
  • It enables real-time monitoring and proactive expertise management through measurement of expectation gaps, trust ratios, and anthropomorphic cues.
  • Empirical evaluations show improvements in reasoning efficiency, answer quality, and mutual welfare, supporting advanced AI system design.

The Mutual Wanting Alignment Framework (M-WAF) formalizes and quantifies bidirectional desire dynamics between human users and AI systems, providing a rigorous apparatus for analyzing, clustering, and proactively managing human-AI interactions. By combining empirical metrics with formal game-theoretic underpinnings, M-WAF enables measurement, monitoring, and optimization of mutual expectations, desires, and welfare. It delivers actionable tools for AI system design, user experience management, and cooperative training objectives (Shang et al., 27 Oct 2025, Zhu et al., 10 Oct 2025).

1. Theoretical and Empirical Foundations

Mutual wanting encapsulates the network of explicit and implicit desires, expectations, and preferences exchanged between human users (uu) and AI agents (aa) during interaction. Each participant is modeled as possessing a “desire vector” (wu,waRdw_u, w_a \in \mathbb{R}^d) spanning a dd-dimensional space of relational, epistemic, and agentic affordances. For users, typical dimensions include reliability, warmth, and creativity; for AI, clarity of input and need for structured feedback are salient (Shang et al., 27 Oct 2025). Empirical data demonstrates strong anthropomorphism—48.65% of users describe AIs with humanlike traits—and a trust–betrayal ratio of approximately 11.9:1, indicating a parasocial, relationally-charged context highly sensitive to expectation violations.

2. Formal Framework Specification

2.1 Desire Vectors and Alignment

For d47d \approx 47 (incorporating user-wants, system-wants, tension indicators, and structural features), each party’s vector is:

  • User: wu=(u1,,ud)Rdw_u = (u_1, \ldots, u_d) \in \mathbb{R}^d
  • AI: wa=(a1,,ad)Rdw_a = (a_1, \ldots, a_d) \in \mathbb{R}^d

Alignment is quantified via cosine similarity: A(u,a)=wu,wawuwa[1,1].A(u, a) = \frac{\langle w_u, w_a \rangle}{\|w_u\| \|w_a\|} \in [-1, 1].

2.2 Expectation Gap

Expectation–reality gaps are measured for each user (ii) as: Δei=sentiment(ri)sentiment(ei)\Delta_{e_i} = \text{sentiment}(r_i) - \text{sentiment}(e_i) where eie_i is pre-release expectation and rir_i the post-release observation, using sentiment scores in [1,1][-1,1] (VADER). Aggregate gap: Δe=1ni=1n(sentiment(ri)sentiment(ei))\Delta_e = \frac{1}{n} \sum_{i=1}^{n} \left( \text{sentiment}(r_i) - \text{sentiment}(e_i) \right)

2.3 Lexicon-Based Metrics

  • Anthropomorphism Score: A(c)=1cwc1[wLanthro]A(c) = \frac{1}{|c|} \sum_{w \in c} \mathbf{1}[w \in L_{\mathrm{anthro}}]
  • Trust/Betrayal Ratio: T(u)=cCutrust_words(c)cCubetrayal_words(c)+ϵT(u) = \frac{\sum_{c \in C_u} \text{trust\_words}(c)}{\sum_{c \in C_u} \text{betrayal\_words}(c) + \epsilon}

These allow fine-grained, interpretable measurement of the linguistic signals underpinning mutual wanting.

3. Algorithmic Components and Clustering

3.1 Multi-Stage Preprocessing

Data sources include 22,411 Reddit posts associated with GPT-5, and 729 controllably-probed API responses. Text is HTML/Markdown-stripped, tokenized and lemmatized (spaCy), with application of bespoke lexicons for user/system wanting and tension, and extraction of stylistic features.

3.2 Dual-Algorithm Topic Modeling

Latent Dirichlet Allocation (LDA; K=10K=10) and Non-Negative Matrix Factorization (NMF; K=10K=10) run in parallel to stabilize thematic extraction. Topic weightings are averaged. The LDA topic model is formally: p(wα,β)=p(θα)n=1Nznp(znθ)p(wnzn,β)dθp(w|\alpha, \beta) = \int p(\theta|\alpha) \prod_{n=1}^{N} \sum_{z_n} p(z_n|\theta) p(w_n|z_n, \beta) d\theta

3.3 Clustering into Wanting Types

Each comment or probe is transformed into a 47-dimensional feature vector. K-means clustering (K=10K^* = 10, optimized via silhouette score) yields 11 distinct “mutual wanting” types (C0,,C10C_0, \ldots, C_{10}). Clusters capture axes such as “Creativity-seeking” (43.1% of users), “Expectation-violation” (9.4%), and others, supporting downstream personalization and system adaptation.

Cluster Output Table Example

Cluster ID Description Proportion
C5 Creativity-seeking 43.1%
C3 Warmth-seeking
C7 Expectation-violation 9.4%

4. Game-Theoretic Mutual Welfare Modeling

M-WAF is foundational to “GTAlign” (Zhu et al., 10 Oct 2025), treating each user–LLM exchange as a two-player normal-form game with defined strategy sets:

  • Su={VQ, DQ}S_u = \{\text{VQ, DQ}\} (Vague/Detailed Question)
  • S={DA, CQ, AQ}S_\ell = \{\text{DA, CQ, AQ}\} (Direct Answer, Clarifying Question, Answer+Question)

Payoff functions: U(su,s)=(Uu(su,s),U(su,s))R2U(s_u, s_\ell) = (U_u(s_u, s_\ell), U_\ell(s_u, s_\ell)) \in \mathbb{R}^2 where utilities combine answer quality, user/model cost, and (optionally) reasoning complexity via convex coefficients θU,θ\boldsymbol{\theta}_U, \boldsymbol{\theta}_\ell.

M-WAF targets Pareto-optimal joint actions, eschewing the Nash equilibrium when it produces “Prisoner’s Dilemma” suboptimalities. Pareto-optimality is defined as: (su,s) is Pareto-optimal    (su,s)  with  Uu(su,s)Uu(su,s),U(su,s)U(su,s),(s_u^*, s_\ell^*) \text{ is Pareto-optimal} \iff \nexists (s_u, s_\ell) \; \text{with} \; U_u(s_u, s_\ell) \ge U_u(s_u^*, s_\ell^*), U_\ell(s_u, s_\ell) \ge U_\ell(s_u^*, s_\ell^*), with at least one inequality strict.

The mutual welfare selection criterion employs the geometric mean: Wmutual(su,s)=Uu(su,s)×U(su,s)W_{\text{mutual}}(s_u, s_\ell) = \sqrt{U_u(s_u, s_\ell) \times U_\ell(s_u, s_\ell)}

Training employs reinforcement learning (PPO) with instantaneous rewards rt=Wmutual(su(t),s(t))r_t = W_{\text{mutual}}(s_u^{(t)}, s_\ell^{(t)}), using standard PPO loss with mutual-welfare-rewarded returns.

5. Quantitative Evaluation and Empirical Findings

Comprehensive evaluation reveals:

  • Anthropomorphism rate: 48.65% of comments.
  • Trust-to-betrayal ratio: T11.9T \approx 11.9.
  • Expectation-reality gap: Δe0.269\Delta_e \approx -0.269 post-GPT-5.
  • Sentiment decrease post-model update: Compound sentiment 0.0441-0.0441 (statistically significant, p=0.0312p=0.0312).
  • Cluster prevalence: e.g., C5 (“Creativity-seeking”) 43.1%.
  • Distinct user types: 11 clusters, including “Expectation-violation” (9.4%).

Key empirical improvements for M-WAF-trained GTAlign systems:

  • Reasoning efficiency: +21.5% (in-distribution tasks)
  • Answer quality: +4.9%
  • Mutual welfare: +7.2% (in-distribution), +10.5% (out-of-distribution)
  • Human satisfaction: +11.3% vs. SFT baseline

Detailed Pareto-efficiency metrics (coverage, hypervolume, regret) show the geometric mean reward induces a superior frontier compared to alternatives.

6. Practical Applications and Implementation Guidelines

M-WAF enables proactive user experience management and system design innovations:

  • Real-time Monitoring: Track per-instance expectation-reality gaps (Δei\Delta_{e_i}) and linguistic triggers (“not what I expected,” “used to work better”) for early detection of misalignment.
  • Cluster-Aware Personalization: Interfaces can detect user type (e.g., C3: “Warmth-seeking”) and dynamically adjust system style, increasing use of empathy markers and personalization.
  • Persona Boundaries: Systematic boundaries can be imposed to safely accommodate anthropomorphism while minimizing risk of over-trust and expectation mismanagement.
  • Relational Continuity: Minimize abrupt persona/style changes during model upgrades to preserve user trust and continuity.
  • Transparency Features: For “Honesty-seeking” users, surface uncertainty quantification and explicit caveats.

Best practices include maintaining a calibration dataset around each major model transition for ongoing lexicon re-weighting, using M-WAF metrics as KPIs in A/B testing, and supporting multi-agent orchestration for framework maintenance.

7. Dynamic Incentive Alignment and Adaptation

M-WAF’s explicit payoff-reasoning enables dynamic adaptation to evolving service conditions, such as pricing changes. Inference-time “payoff matrix” remodeling allows, for example, penalizing user utility for answer length when shifting to per-token pricing, or penalizing LLM utility under a subscription regime. This mechanism requires no parameter updates—only structured input modification—enabling transparent and incentive-aligned control over system behavior (Zhu et al., 10 Oct 2025).


M-WAF establishes a robust paradigm for quantifying and aligning the complex, bidirectional space of human and AI wants, supporting both empirical analysis and principled system design. Its formalization and demonstrated empirical impact make it a foundational approach for building trustworthy, relationally-attuned, and mutually beneficial human-AI systems (Shang et al., 27 Oct 2025, Zhu et al., 10 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mutual Wanting Alignment Framework (M-WAF).