- The paper demonstrates that a Bayesian trust-based model predicts inevitable lock-in when the spectral radius of the trust network exceeds unity.
- Agent-based GPT simulations show that repeated human-LLM interactions lead to a significant drop in conceptual diversity over time.
- Empirical analysis using regression techniques confirms that iterative LLM training correlates with abrupt decreases in belief diversity.
The paper "The Lock-in Hypothesis: Stagnation by Algorithm" (2506.06166) investigates the potential negative consequences of feedback loops between LLMs and human users. The authors propose that this dynamic interaction, where models learn from human data, influence human opinions through their output, and then reabsorb those influenced beliefs, can lead to a state of "lock-in."
The Lock-in Hypothesis, as formalized in the paper, states that this feedback loop will eventually cause a population to converge on specific beliefs, potentially false ones. Once formed, these beliefs become resistant to change, amplified by feedback loops and human trust in the AI.
To formalize this, the authors develop a Bayesian model involving a group of N agents (which can represent humans or AI) estimating an unknown quantity. Each agent maintains a private belief and an aggregate belief informed by their own observations and the beliefs of other agents they trust. The interactions are modeled by a trust matrix W, where wi,j indicates the degree to which agent i trusts agent j's belief. The paper shows that if the spectral radius of the trust matrix ρ(W) is greater than 1, collective lock-in to a false belief is inevitable. This condition implies that the feedback loop is self-amplifying due to sufficient mutual trust. They apply this to a specific human-LLM dynamic with one AI agent and N−1 human agents, showing lock-in occurs if (N−1)λ1λ2>1, where λ1 is the AI's trust in humans (preference learning strength) and λ2 is humans' trust in the AI. This condition suggests that even moderate mutual trust can lead to lock-in in a sufficiently large group.
The paper complements this theoretical model with agent-based LLM simulations. Using GPT-4.1-Nano (for agents and authority), Setup C simulates 100 agents holding natural language beliefs on a given topic (from r/ChangeMyView) and interacting with a centralized LLM authority. The authority aggregates group beliefs and provides a summarized belief, which agents then use to update their own, based on a pre-assigned trust level. The simulations demonstrate that as interactions progress, agents' beliefs converge, leading to a significant drop in conceptual diversity. This "belief shift" can result in convergence on extreme views or, in some cases, hedged stances, depending on the topic and LLM behavior. The simulations support the idea that diversity loss is an observable metric of lock-in.
Empirical evidence is sought from the WildChat-1M dataset (Zhao et al., 2 May 2024), which contains logs of human interactions with a ChatGPT mirror site. The authors analyze the conceptual diversity in human messages over time, using a constructed concept hierarchy and a novel "lineage diversity" metric that accounts for hierarchical structure. They test two main hypotheses:
- Collective Diversity Loss: Conceptual diversity in the corpus of human messages decreases over time.
- Iterative Training Leads to Loss: Diversity decreases discontinuously when new GPT iterations (trained on new human data) are deployed.
Results show ambiguous support for Hypothesis 1, with a downward trend in diversity for GPT-4 interactions but an upward trend for GPT-3.5-turbo interactions on value-laden concepts. However, Hypothesis 2 receives stronger support. Using a regression kink design (RKD), the authors detect significant discontinuous downward shifts in conceptual diversity following the release dates of new GPT versions (GPT-4-0125-preview, GPT-3.5-turbo-0613, GPT-3.5-turbo-0125). Per-user regression analysis on high-engagement users also tentatively supports Hypothesis 2, suggesting the impact is sustained and not merely a temporary phenomenon at release dates.
The authors acknowledge limitations, including potential confounding factors in the observational WildChat data and the simplified nature of their simulations compared to real-world interactions. They propose future work, including randomized controlled trials (RCTs) with human subjects, more realistic simulations incorporating external evidence and diverse sources, and the development of systematic evaluation and mitigation strategies for lock-in effects.
The paper concludes that the findings provide early-stage evidence supporting the lock-in hypothesis, particularly concerning the impact of iterative model training on collective conceptual diversity. It highlights the importance of further research into the dynamics of human-LLM interaction and the potential need for technical, algorithmic, or policy interventions to mitigate adverse consequences.