Explicit–Implicit Bias Divergence in Smaller or Older LLMs

Determine whether the explicit–implicit divergence in bias expression across evaluation tasks manifests similarly in smaller-scale or older large language models.

Background

The paper demonstrates that bias expression in LLMs is task-dependent: models tend to counter-stereotype on explicit probes while reproducing stereotypes on implicit ones. This effect is interpreted as a consequence of safety alignment suppressing overt bias without altering underlying associations.

All evaluated systems are frontier-scale 2026 models spanning commercial and open-weight families. The authors note a gap in understanding whether the same divergence between explicit and implicit tasks appears in models that are smaller or from earlier generations, which limits the generalizability of their findings.

References

Our model set, while spanning commercial and open-weight families, consists entirely of frontier-scale 2026 models. Whether the explicit-implicit divergence manifests similarly in smaller or older models remains an open question.

— Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments (2604.02669 - Kumar et al., 3 Apr 2026) in Discussion, Limitations

Explicit–Implicit Bias Divergence in Smaller or Older LLMs

Background

References

Related Problems