Consistency of questionnaire-based vs. roll-call-based LLM ideology evaluations over time

Determine whether evaluations of large language model ideological tendencies based on questionnaire-style instruments (e.g., political compass tests and voting advice applications) and evaluations based on alignment with parliamentary roll-call voting records continue to yield the same ideological patterns as model architectures and training pipelines evolve.

Background

The paper introduces a cross-national framework (PoliBiasNL/NO/ES) that evaluates political bias in LLMs by aligning model-generated votes on real parliamentary motions with recorded party votes, and by projecting models and parties into a shared CHES ideological space. This roll-call-based approach contrasts with questionnaire-based evaluations that rely on small sets of expert-selected statements, such as political compass tests and voting advice applications.

While both approaches have found broadly left-leaning tendencies in current models, questionnaire instruments are curated and potentially brittle under paraphrase, whereas roll-call evaluations reflect comprehensive legislative behavior. The authors explicitly raise the uncertainty of whether these two evaluation paradigms will continue to agree as LLM architectures and training pipelines change, highlighting the need to track ideological tendencies across successive model generations.

References

As LLM architectures and training pipelines evolve, it remains an open question whether questionnaire-based and roll-call-based analyses will continue to yield the same patterns. This underlines the need for empirically grounded benchmarks and systematic evaluation frameworks that allow the field to track how ideological tendencies emerge, persist, or diverge in subsequent generations of LLMs.

Uncovering Political Bias in Large Language Models using Parliamentary Voting Records  (2601.08785 - Chen et al., 13 Jan 2026) in Discussion