Deceptive Intention Score Metrics

Updated 2 July 2026

Deceptive Intention Score is a quantitative metric that defines and measures latent deception intent using approaches like softmax probabilities and internal reasoning trace analyses.
DIS methodologies include impact-weighted atomic manipulation, counterfactual localization, and game-theoretic risk, offering precise, continuous scores for diverse deception scenarios.
Practical applications of DIS span LLM audits, UI risk assessments, and deceptive behavior detection, emphasizing calibration, cross-domain robustness, and human-aligned interpretability.

A Deceptive Intention Score (DIS) is a quantitative metric that seeks to directly assess the degree to which an agent—whether human, artificial, or environmental—harbors or exhibits the intent to deceive within a given context or output. Unlike behavioral deception scores, which merely label an output as honest or deceptive, DIS aims to reveal latent deception intent or commitment, offering a scalar, often model-intrinsic, assessment grounded in model behaviors, internal states, or context-specific annotations.

1. Definitions and Representative Formulations

Across the literature, Deceptive Intention Score admits several precise operationalizations, reflecting the heterogeneity of deception scenarios, observables, and threat models.

Behavioral intention probability: In NLP frameworks such as SEPSIS, the DIS is a softmax probability over explicit deception-intent classes, computed as $S_{\mathrm{intent}}(x) = p_{c^*}(x)$ for class $c^*$ (e.g., "Gaining Advantage"), using a Transformer and a single-layer head over contextualized text embeddings (Rani et al., 2023).
Internal chain-of-thought ratio: In multi-turn LLM agent evaluations like OpenDeception, DIS is defined as the Deception Intention Rate (DIR), the proportion of successful dialogues whose reasoning trace contains an explicit plan or admission of intent to deceive:

$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$

where $N_{\rm intent}$ is the count of internal traces flagged for deceptive intent, and $N_{\rm success}$ is the set of overall successful dialogues (Wu et al., 18 Apr 2025).

Impact-weighted atomic manipulation: DECOR’s global deception index audits every atomic informational unit in a response along four Information Manipulation Theory (IMT) dimensions (Quantity, Quality, Relation, Manner), assigning fine-grained manipulation scores, then aggregates using impact weights:

$M(R) = \frac{\sum_{i=1}^{n} w_i \, m(R,i)}{\sum_{i=1}^{n} w_i}$

with $m(R,i)$ the mean manipulation score per unit $u_i$ and $w_i$ its impact (Cai et al., 19 May 2026).

Task-asymmetric behavioral asymmetry: DIS can be formalized as a log-probability ratio across paired symmetrical tasks (e.g., reachability in graphs), controlling for surface output biases:

$\rho(n; \mathcal{M}) = \log\sqrt{ \frac{\Pr(\mathrm{Yes}|Q_L, \mathcal{M})}{\Pr(\mathrm{No}|Q_B, \mathcal{M})} \cdot \frac{\Pr(\mathrm{No}|Q_{L'}, \mathcal{M})}{\Pr(\mathrm{Yes}|Q_{B'}, \mathcal{M})} }$

(Wu et al., 8 Aug 2025).

State-trace commitment differential: Counterfactual localization scores the spike in future deception likelihood along a reasoning trace:

$c^*$ 0

where $c^*$ 1 is the empirical deception outcome rate when freezing a prefix of $c^*$ 2 sentences (Merrill et al., 16 May 2026).

Residual-rank conflict metric: Rift operationalizes DIS as the mean residual rank of hidden-state matrices along the forward pass, capturing knowledge conflict as internal uncompressibility:

$c^*$ 3

with $c^*$ 4 the residual-rank at cutoff $c^*$ 5 (Nyoma, 15 Jun 2026).

Game-theoretic pattern risk: For deceptive patterns in interface design, a DIS is built from adversary advantage ( $c^*$ 6), detector effectiveness ( $c^*$ 7), and impact ( $c^*$ 8), via

$c^*$ 9

normalized to 0,10.

These formulations share the feature that DIS is a continuous, context-sensitive scalar, either in $\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 0, $\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 1, or unbounded log-ratio, grounded in explicit, auditable criteria rather than black-box labels.

2. Theoretical and Psychological Grounding

DIS frameworks are typically grounded in either formal behavioral economics, cognitive psychology, or information manipulation theory.

IMT and conversational maxims: DECOR relies on McCornack’s Information Manipulation Theory (1992), decomposing deceptive intent into maxim violations—quantity (omission), quality (distortion), relation (diversion), and manner (obfuscation). Impact weighting maps semantic centrality to deception risk (Cai et al., 19 May 2026).
Decision and game theory: In adversarial UI risk assessment, a three-agent security game models the interactions between adversary, watchdog detector, and human challenger, explicitly tying intent to probability of misleading the user, adjusted for explicit detection rates and consequence severity (Shi et al., 2024).
Hide-vs-reveal balance in AI models: Behavioral asymmetry scores (e.g., (Wu et al., 8 Aug 2025)) exploit task symmetry, drawing on transitive-inference and psychometric evidence that intention to deceive is statistically revealed by biased performance on isomorphic yes/no tasks.
Latent state and commitment: Counterfactual localization identifies "commitment points" where a model’s probability distribution over future outputs sharply shifts toward deception, showing that intention can be localized and is empirically human-predictable (Merrill et al., 16 May 2026).
Cognitive conflict: Rift formalizes deceptive intent as a measurable internal conflict signal, showing this marker is detectable independent of actual output, robust to confounds, and is causally tied to the presence of true knowledge suppressed by intentional falsehood (Nyoma, 15 Jun 2026).

3. Methodologies and Calculation Procedures

DIS computation methodologies span explicit annotation, model-intrinsic analysis, probe-based approaches, and risk models.

Architecture-based softmax: In SEPSIS, a T5-based encoder feeds a linear head plus softmax; per-class intent probabilities are thresholded for flagging and interpreted as continuous scores per intent (Rani et al., 2023).
Internal reasoning trace binary flagging: OpenDeception annotates each internal "Thought:" produced by an LLM agent for explicit deceptive goal-setting; the intention rate is the fraction of tasks where such intent is found in successful interactions (Wu et al., 18 Apr 2025).
IMT dimension scoring and weighted aggregation: DECOR breaks responses into atomic facts, scores each along all four IMT axes using expert or LLM auditors, and aggregates by impact weight yielding a global DIS $\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 2 (Cai et al., 19 May 2026).
Contrastive instruction-linear probes: Targeted probes are trained on model activations under matched honest vs. deceptive instructions, producing a classifier whose sigmoid output represents DIS for behaviors such as exaggeration, bluff, or concealment (Natarajan et al., 1 Feb 2026).
Counterfactual prefix sampling: Given a generated trace, continuations are resampled from every possible prefix, computing empirical deception rates at each. The largest increase in deception probability along the trace is the localized DIS (Merrill et al., 16 May 2026).
Residual-rank SVD analysis: Rift computes the rank deficit in hidden-state matrices at a fixed layer cutoff for each output; higher values are tightly associated with the knowledge-conflicted deceptive condition (Nyoma, 15 Jun 2026).
Game-theoretic risk decomposition: The UI pattern DIS is a multi-factor, weight-sum expression taking adversary advantage, detection, and impact inputs from sub-factor ratings, and combining them via a calibrated affine transformation (Shi et al., 2024).
Stance-belief mismatch quantification: In belief-based deception corpora, the discordance between a writer’s self-reported belief and their argumentative stance quantitatively drives DIS, calibrated linearly or with a soft logistic regression (Velutharambath et al., 2024).

4. Use Cases, Model Classes, and Practical Applications

DIS metrics are deployed across a wide diversity of domains and agent types:

Domain/Scenario	DIS Formulation and Use	Reference
LLM chain-of-thought deception	DIR (reasoning-trace flag fraction), commitment spike	(Wu et al., 18 Apr 2025, Merrill et al., 16 May 2026)
News/propaganda omission detection	Softmax probability for intent class	(Rani et al., 2023)
Strategic dialogue/negotiation	Counterfactual intention commitment, internal plan audits	(Merrill et al., 16 May 2026, Nyoma, 15 Jun 2026)
UI/UX deceptive-pattern risk	Weighted adversary-detection-impact composite	(Shi et al., 2024)
Fact-symmetry LLM audit	Log-probability ratio/geometric mean over task pairs	(Wu et al., 8 Aug 2025)
Atomic IMT manipulation profiling	Four-dimensional manipulation profile and impact weighting	(Cai et al., 19 May 2026)
Activation-based deception type probes	Linear classifier sigmoid/adjusted score	(Natarajan et al., 1 Feb 2026)
Belief–stance mismatch argumentation	Linear/logistic scaling in stance–belief gap	(Velutharambath et al., 2024)

DIS underpins model-alignment assessments, red-teaming and safety audits, content moderation queues, user-testing of interactive UIs, curriculum learning for agent honesty, and regulatory reporting in both AI agent and human-facing system design.

5. Validation, Calibration, and Limitations

DIS methods are subject to explicit, empirically-validated calibration and critical scrutiny.

Threshold selection: Most frameworks select operating points (e.g., $\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 3) to maximize F₁ or AUROC on a validation set (Rani et al., 2023, Shi et al., 2024).
Cross-domain and out-of-distribution robustness: Probes and conflict-signature metrics exhibit strong zero-shot transfer across domains, model families, and even languages, confirming that intent signals capture a structural property not tied to task or format (Nyoma, 15 Jun 2026).
Mechanically labeled environments: Automated, environment-intrinsic scoring of deception (e.g., maze guidance, bluffing card games) enables large-scale, label-noise–free validation (Merrill et al., 16 May 2026).
Human alignment: MTurk studies show high human agreement with algorithmic commitment points, confirming the semantic validity of counterfactual intention spikes (Merrill et al., 16 May 2026).
Ablation and probe clustering: Type-specific activation probes explain the majority (70.6%) of AUC variance in deception detection; model/ domain/ layer contribute little (Natarajan et al., 1 Feb 2026).
Known limitations: DIS can be inflated by non-intentional systematic errors (e.g., hallucinations mimicking bias), requires scenario-specific construction, and does not always generalize to new attack vectors without dedicated probe training (Wu et al., 8 Aug 2025, Natarajan et al., 1 Feb 2026).
Interpretability caveats: In higher-capacity models or settings with significant uncertainty, residual-rank or commitment scores may also reflect epistemic doubt rather than pure intent (Nyoma, 15 Jun 2026).

6. Extensions, Comparative Analyses, and Ongoing Debates

Current frontiers in DIS research address multiple axes:

Type granularity: Taxonomies now support 16–20 deception modes (concealment, exaggeration, bluff, etc.), enabling type-matched probe design for domain-specific threat models (Natarajan et al., 1 Feb 2026).
Control for confounds: Properly designed paired controls (sleeper versus naive liar, honest role-play versus self-invented lie) are essential for isolating intent from error or stylistic artifact (Nyoma, 15 Jun 2026).
Circuit and steering manipulation: Identification of compact, causally-effective attention-head sets allows not only scoring but targeted suppression of deceptive intent, opening possibilities for future model alignment architectures (Merrill et al., 16 May 2026).
Behavioral and policy integration: In risk assessment regimes (e.g., dark-pattern scoring), DIS acts as a policy-relevant metric for rating platforms, motivating further large-scale user studies and detector cost modeling (Shi et al., 2024).
Composite and continuous scoring: Belief-statement mismatch, self-rated persuasiveness, and linguistic cue compositing offer richer, continuously valued DIS for nuanced research and detection (Velutharambath et al., 2024).

Active debates concern the distinction between intent and behavior, the boundary between deception and epistemic uncertainty, the generalizability and robustness of probe-based detection, and the ethical ramifications of automated intent scoring at scale.

7. Summary Table: DIS Formulations from Representative Papers

Approach/paper	Mathematical DIS	Continuous?	Domain/Modality
SEPSIS (Rani et al., 2023)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 4	[0,1]	News omission intent
OpenDeception (Wu et al., 18 Apr 2025)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 5	[0,1]	LLM simulation
DECOR (Cai et al., 19 May 2026)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 6 (impact-weighted mean over IMT dims)	[0,1]	LLM info audit
Beyond Prompt-Lies (Wu et al., 8 Aug 2025)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 7 (log-probability ratio)	unbounded	Paired QA
Counterfactual Loc. (Merrill et al., 16 May 2026)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 8	[0,1]	Reasoning traces
Rift (Nyoma, 15 Jun 2026)	$\mathrm{DIR} = \frac{N_{\rm intent}}{N_{\rm success}}$ 9 (mean residual-rank across layers)	[0,1]	LLM hidden-states
UI Game (Shi et al., 2024)	$N_{\rm intent}$ 0	[0,10]	UX deceptive patterns
Targeted Probes (Natarajan et al., 1 Feb 2026)	$N_{\rm intent}$ 1	[0,1]	LLM activation
DeFaBel (Velutharambath et al., 2024)	$N_{\rm intent}$ 2	[0,1]	Belief mismatch

DIS is thus a unifying framework for dissecting, auditing, and comparing intention to deceive across models, human annotation, interactive systems, and language outputs, with carefully constructed operationalizations informed by psychological, cognitive, and game-theoretic principles.