Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

DEMOS' Position in Prompt (DPP) Bias

Updated 1 August 2025
  • DEMOS' Position in Prompt (DPP) Bias is a systematic phenomenon where demo placement in a prompt alters LLM predictions due to sequential processing and induction biases.
  • Empirical studies show that positioning demos at the start can yield up to a 6-point accuracy gain and reduce prediction flips compared to demos placed at the end.
  • Mitigation strategies such as position-agnostic fine-tuning and test-time calibration are proposed to enhance robustness and reproducibility in LLM applications.

DEMOS' POSITION IN PROMPT (DPP) Bias refers to the systematic and quantifiable effect that the location of demonstration examples (or "demos") within a prompt exerts on the predictions and accuracy of in-context learning (ICL) in LLMs. DPP bias is a manifestation of positional sensitivity in autoregressive and instruction-following models, such that shifting otherwise-identical demos to different structural “slots” within the prompt (e.g., start vs. end) can lead to nontrivial changes in task performance or complete reversals in the model’s outputs—even when all content is held constant (Cobbina et al., 30 Jul 2025). This bias is intrinsic to the sequential processing and inductive biases of contemporary LLM architectures and has meaningful implications for both robustness and reproducibility in ICL applications.

DPP bias specifically denotes the phenomenon where the precise location of ICL demos in an LLM prompt (e.g., system-level, user-segment, or after the user message) induces significant, sometimes drastic, shifts in prediction accuracy and output volatility (Cobbina et al., 30 Jul 2025). In the context of prompt-based learning, such positional bias is distinguished from biases due to demo content, ordering, or selection: for any fixed set of demos, spatial rearrangement alone is sufficient to trigger DPP effects.

This form of bias is rooted in autoregressive token processing and the architectural features of LLMs:

  • Primacy bias: Early prompt tokens (including demos) are preferentially encoded and exert greater influence owing to left-to-right context accumulation and the functioning of induction heads.
  • Induction head effects: Specialized attention patterns that correlate and propagate the structure of early demos to later candidate examples.
  • Sequential context assignment: Autoregressive decoders cannot generally re-weight distant context uniformly, so spatial proximity dictates how much information from demos is integrated. DPP bias relates to, but is distinct from, other forms of position bias, such as those in extractive QA where answer span start/end are predicted with a bias to early positions (Ko et al., 2020), or from prompt-only bias where fixed prompt templates favor specific labels (Xu et al., 15 Mar 2024). In DPP, the focus is specifically on demonstration positioning within ICL prompts.

2. Empirical Characterization and Quantitative Metrics

A systematic evaluation pipeline isolates DPP bias by holding demo content and task instructions fixed while varying only the block position of demos within the prompt (Cobbina et al., 30 Jul 2025). Four canonical positions are evaluated:

  • Start of system prompt (ssp)
  • End of system prompt (esp)
  • Start of user message (sum) (the default in most ICL work)
  • End of user message (eum)

Two primary metrics operationalize DPP bias:

  • ACCURACY-CHANGE (AmetricA_\text{metric}): For prompt position pp, Ametric=Metricposition=pMetriczero-shotA_\text{metric} = \text{Metric}_{\text{position}=p} - \text{Metric}_{\text{zero-shot}}. This net change quantifies the accuracy gain or drop induced by moving demos to pp.
  • PREDICTION-CHANGE (ApredA_{\text{pred}}): The proportion of examples where the predicted answer flips when changing the demo position, typically calculated as a percentage relative to a default setup.

Empirical results consistently show:

  • Placing demos at the start of prompts (ssp/esp/sum) yields the most stable and highest accuracy outputs, sometimes with gains up to +6 accuracy points (e.g., on AG News).
  • Demos at the end of user messages (eum) can flip more than 30% of predictions on QA tasks, frequently degrading rather than improving performance.
  • These effects are robust across task types (classification, QA, summarization, reasoning) and LLM families (Qwen, LLaMA3, Mistral, Cohere).

3. Model Scale, Sensitivity Profiles, and Task Variability

DPP bias is modulated by both model size and task complexity (Cobbina et al., 30 Jul 2025). Key findings include:

  • Smaller models (e.g., Qwen 1.5B, LLaMA3 3B) exhibit extreme prediction volatility with respect to demo location—sometimes doubling the prediction-change rate relative to larger models.
  • Larger models (e.g., LLaMA3 70B) are marginally more robust but still display non-negligible instability for complex tasks (e.g., generative summarization on XSUM/CNN/DM).
  • Task dependence: The magnitude of position sensitivity varies—classification tasks (e.g., AG News) are especially susceptible, while more generative or arithmetic tasks are slightly less so, but still affected.

Summary table:

Model Size Max Accuracy Gain Max Prediction Change Most Stable Demo Position
Small (3B/7B) +6 points (AG News) >30% (QA tasks) Start (ssp/esp/sum)
Large (70B) Marginal (1-2 pts) Up to 10–15% Start preferred; some immunity

Smaller models stand to benefit most from careful prompt design, though even top-tier LLMs exhibit measurable DPP-induced variance on challenging benchmarks.

4. Implications and Mitigation Strategies

The existence of DPP bias has meaningful consequences for both LLM research and real-world deployment:

  • Prompt engineering and evaluation: Designers must recognize that demo position is not an ancillary detail but a critical control variable. Evaluation pipelines should report performance under multiple demo configurations, or incorporate A_metric and A_pred as standard metrics.
  • Calibration and stability: In safety-critical settings (e.g., medical QA, legal summarization), the discovery that >30% of predictions can flip with demo relocation mandates robust calibration or position-invariant mechanisms.
  • Potential mitigation approaches:
    • Test-time calibration: Retrieval-based selection of optimal demo positions for each instance or class.
    • Position-agnostic fine-tuning: Post-training on datasets with randomly assigned demo slotting could encourage position-invariant representation.
    • Automated prompt composition: Integrating demo content and position optimization (potentially via gradient-based or RL approaches).

Mitigation remains an open problem—no universal demo configuration outperforms all others, and both prompt structure and model scale modulate DPP effects.

5. Relationship to Theoretical and Architectural Biases

DPP bias is underpinned by the sequential, autoregressive nature of LLMs (Cobbina et al., 30 Jul 2025). The primacy bias, inductive representations of early context, and the learned conventions of training corpora (where system/user formatting often puts instructive content at the top) all contribute to the heightened stability of early demos.

No evidence was found that placing demos at the end (eum) yields any generalization benefit; on the contrary, this configuration systematically increases prediction volatility without boosting correctness.

For a more formalized understanding, the DPP effect is a product of the transformer’s context accumulation: induction heads and internal attention layers overweight the earliest tokens, leading to a model-wide encoding pattern with positional inertia. This is aligned with similar findings in position bias for extractive QA and other prompt-tuned settings (Ko et al., 2020, Yang et al., 2023, Mao et al., 2023).

6. Future Directions and Research Recommendations

Several directions are proposed for further research:

  • Mechanistic interpretability: Disentangling the contributions of induction heads, positional encodings, and transformer block depth to DPP bias.
  • Automated cross-position validation: Development of test-time routines or pipelines for systematically evaluating positional robustness.
  • Post-training or instruction tuning: Random permutation of demo blocks during fine-tuning stages to facilitate inherent model invariance to demo placement.
  • Joint optimization: Searching over content and structure to minimize A_pred and maximize A_metric, potentially leveraging differentiable search or meta-learning methods.
  • Task and corpus diversity: Expanding evaluation to new architectures, languages, and complex multi-turn dialogues to further map the DPP bias landscape.

7. Conclusion

DEMOS' POSITION IN PROMPT (DPP) Bias constitutes a demonstrable and quantifiable positional sensitivity in LLMs, with the spatial arrangement of ICL demonstrations exerting a substantial effect on both accuracy and prediction volatility. The bias is especially pronounced in smaller models and classification tasks, but no current open-source model is entirely immune. Awareness of DPP bias is essential for prompt engineering, benchmarking, and stable LLM deployment in both research and production. Comprehensive evaluation pipelines and future mitigation strategies should treat prompt position as a first-order factor in ICL, not a superficial formatting choice (Cobbina et al., 30 Jul 2025).