Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Published 31 Mar 2026 in cs.AI and cs.HC | (2603.29953v1)

Abstract: How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.

Abstract PDF Upgrade to Chat

Authors (1)

Peng Gang

Summary

The paper demonstrates that explicit structured encoding, such as PPS/5W3H, significantly improves goal alignment across diverse models and languages.
Cross-model and cross-language variances are reduced dramatically, with up to a 24-fold decrease in score deviation and enhanced compensation for weaker models.
User studies reveal that AI-assisted structured prompting cuts interaction rounds by 60% and boosts satisfaction, although excessive complexity may hinder performance.

Structured Intent as a Protocol-Like Communication Layer: Empirical Analysis of Cross-Model Robustness and Compensatory Effects

Introduction and Motivation

The paper advances the study of structured intent representation, specifically Prompt Protocol Specification (PPS) grounded in the 5W3H framework, as a paradigm for mitigating intent transmission loss in Human-AI communication. The persistent variability of LLM outputs—across model architectures, natural languages, and usage contexts—stems from ambiguous user intent encoding. While the prompt engineering literature abounds with execution-layer heuristics (e.g., chain-of-thought, role assignment), this work targets the intent layer, positing that systematic dimensional decomposition can function as a protocol-like abstraction, standardizing goal encoding irrespective of downstream model or language.

This study addresses unresolved questions regarding the generalizability and specificity of structured intent methods: does the intent alignment benefit arise from explicit structuring per se, or from some idiosyncrasy of the 5W3H decomposition? Additionally, how does model capability mediate the impact of explicit structuring?

Experimental Design and Comparative Frameworks

A robust experimental matrix—3,240 model outputs spanning three state-of-the-art LLMs (Claude, GPT-4o, Gemini 2.5 Pro), three typologically distinct languages (Chinese, English, Japanese), six prompting conditions, and three diverse domains (travel, business, technical)—enables fine-grained dissection of cross-model, cross-language, and cross-framework effects. The selected structured frameworks for comparison are 5W3H (PPS), CO-STAR, and RISEN, covering a spectrum of decomposition granularities (eight, six, and five dimensions, respectively).

The evaluation regime is GA (goal alignment) scoring, assigned by DeepSeek-V3—a model independent of all generating models and prompt expansion methods, precluding shared architectural or training data artifacts.

A substantial user study (N=50), performed in situ using participants' preferred tasks and AI platforms, assesses real-world interaction efficiency and user satisfaction gains from AI-assisted prompt expansion and manual refinement.

Main Findings

Structured Intent Encoding: Consistent Gains and Mechanistic Insights

All structured frameworks (5W3H, CO-STAR, RISEN) statistically saturate intent alignment at current metric resolution. The mean goal alignment scores (4.930–4.983/5) are consistently superior to unstructured (simple prompt) and structure-only (raw JSON) controls, confirming that explicit dimension encoding is the critical factor. Notably, the marginal advantage of CO-STAR and RISEN over 5W3H reflects lower input complexity rather than informational superiority, as 5W3H encodes a strictly larger (superset) dimensional space.
Cross-language and cross-model variance is dramatically suppressed. Structured conditions produce up to 24× reduction in cross-language score standard deviation versus unstructured baselines (e.g., average sigma drops from 0.470 to 0.019), and models with divergent baseline alignment converge to near-identical performance under explicit structuring. This empirically demonstrates that structured intent encoding acts as a language-agnostic interface.

Weak-Model Compensation Effect

The structured intent benefit is disproportionately large for weaker models. Gemini (the weakest model in baseline conditions) shows a +1.006 point improvement from structured prompting, compared to +0.217 for Claude, representing a 4.6-fold greater benefit. This effect is robust across domains and languages, and cannot be solely attributed to scale ceiling artifacts. The finding substantiates the information-theoretic hypothesis that explicit protocol-like encoding externalizes otherwise model-dependent inference.

User Study: Practical Utility

AI-assisted expansion of 5W3H prompts reduces real interaction cost. Rounds to user satisfaction decrease by 60% (4.05 → 1.62), and satisfaction increases by +0.88 on a 5-point scale. 82% of users require at most two dimensions of adjustment, indicating high initial quality of AI-generated structuring and validating the accessibility of such pipelines even for non-experts. Qualitative evidence highlights improved task specification and lower intent ambiguity for complex instructions.

Boundary Conditions: Encoding Overhead

Excessive structural complexity induces negative returns in some model-language-task regimes. For GPT-4o in Japanese, Condition D (AI-expanded 5W3H) underperforms the simple baseline on complex technical/business tasks. This encoding overhead effect implies a non-monotonic relationship between structure and benefit, supporting a formalization of optimal intent encoding as a function of task complexity and model executional capacity.

Theoretical Implications

The experimental convergence of multiple structured frameworks, pronounced cross-linguistic robustness, and the quantification of the weak-model compensation effect support an abstracted view of intent encoding as a communication protocol. While the present PPS implementation does not instantiate full protocol features (negotiation, error correction), it provides versioned, fingerprinted, and elastic intent serialization, empirically validated across a heterogeneous receiver set. This reconceptualization aligns with an information-theoretic reading—explicit structuring reduces the conditional entropy of intent given the prompt, making rarefied inferential gaps less consequential for weaker models.

The findings expose the limitations of surface-level template engineering and suggest that future research should focus on formalizing the interaction between intent structure granularity, receiver/model capacity, and the topology of the task space. Furthermore, there is a strong case for external gold-intent benchmarks and continuous, multi-aspect evaluation metrics to probe for nuanced framework-specific differences beyond current integer ratings.

Limitations

Several limitations should temper the interpreted scope of these results:

The goal alignment scale (1–5) saturates under structured conditions, masking potential fine-grained framework differences.
The absence of an independent gold-intent target means improvements may partially reflect increased specification rather than strictly fidelity gain.
Judge model bias remains a factor, and results would be fortified by multi-judge or blinded human annotation.
The order effect in the user study (A → D_raw → D_mod) and its skew toward technically literate participants moderate generalizability.
Ecological mismatch between models tested experimentally (frontier international LLMs) and those used in the user study (primarily Chinese domestic models) means cross-ecosystem invariance is suggestive but not conclusively established.

Conclusion

The evidence establishes that dimensional decomposition of user intent—embodied in PPS/5W3H, CO-STAR, and RISEN—consistently enhances instruction alignment, model- and language-agnostic robustness, and interaction efficiency. The magnitude of the benefit is modulated by model inferential strength, enabling protocol-like compensation for weaker receivers. These findings validate the intentional design of protocol-like intent representations as a foundation for scalable, reliable human-AI communication.

There is substantial opportunity for further theoretical development, including formalization of information-theoretic bounds on intent alignment, adaptive encoding strategies, and richer, user-centered benchmarks. The results motivate a reframing of prompt engineering: not as ad hoc string manipulation, but as principled protocol design.

Reference:

"Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect" (2603.29953).

Markdown Report Issue