How is ChatGPT's behavior changing over time? (2307.09009v3)

Published 18 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: GPT-3.5 and GPT-4 are the two most widely used LLM services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4's ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

PDF Abstract

An Analysis of Temporal Behavioral Changes in ChatGPT

The paper "How Is ChatGPT's Behavior Changing over Time?" provides a detailed examination of the performance fluctuations observed in different versions of two prominent LLMs, GPT-3.5 and GPT-4, over a brief period from March 2023 to June 2023. The research articulates the need for consistent monitoring of LLM behavior due to significant variations observed across diverse tasks, implicating both practical application challenges and theoretical considerations.

Study Motivation and Scope

The authors foreground the paper with a crucial observation: the opaqueness surrounding the update mechanisms and schedules for LLMs like GPT-3.5 and GPT-4. This opaqueness complicates their stable integration into workflows where consistent model behavior is paramount. The paper investigates this issue by evaluating both versions of GPT-3.5 and GPT-4 on an array of tasks that cover mathematics, programming, sensitive content handling, opinion surveys, multi-hop questions, medical exams, and visual reasoning.

Key Findings

Performance Variability

The paper finds that the performance of both GPT-3.5 and GPT-4 fluctuates notably between March and June 2023 across different tasks:

Mathematical Reasoning: GPT-4's ability to accurately identify prime versus composite numbers deteriorated drastically from 84% to 51% accuracy. Interestingly, GPT-3.5's performance improved in this task.
Sensitive Questions and Opinion Surveys: A considerable decrease in GPT-4's willingness to respond to sensitive questions and opinion surveys was observed. Response rates to opinion surveys, for instance, dropped from 97.6% to 22.1% for GPT-4.
Programming: The models' ability to generate executable code deteriorated, with GPT-4's directly executable code outputs reducing from 52% to 10%.
Medical Exams and Visual Reasoning: Slight performance improvements were observed for visual reasoning tasks, while a minor drop was noted for GPT-4 taking the USMLE medical exams.

The paper distinguished between task-oriented performance and compliance with user instructions. It hypothesized that one common denominator behind many behavior drifts was the LLMs' ability to follow user instructions. This hypothesis was substantiated by results showing a significant decrease in GPT-4's ability to adhere to user instructions over time.

Instruction Fidelity

Focusing specifically on instruction adherence, the paper quantitatively analyzed the response fidelity of GPT-4 to straightforward and complex composite instructions:

Individual Instructions: Instruction adherence by GPT-4 dropped dramatically. For example, compliance with the instruction to respond with a 'yes' or 'no' enclosed in brackets fell from 99.5% in March to nearly zero in June.
Composite Instructions: There was a marked decrease in accuracy on tasks involving composite instructions. For instance, tasks combining 'add comma' and 'capitalize each letter' saw a drop in accuracy from March to June.

This reduction in instruction fidelity can lead to unexpected behavior in downstream applications relying on LLM-generated outputs to follow specific formats or content guidelines, illustrating the critical need for robust monitoring.

Practical and Theoretical Implications

The research underscores multiple practical and theoretical considerations:

Continuous Monitoring: The varying behavior highlights the pressing need for ongoing evaluation of these models. Businesses and developers using LLMs must incorporate continuous monitoring to ensure consistency and reliability in application outputs.
Rethinking Update Mechanisms: The paper prompts a reassessment of how updates to LLMs are managed, suggesting that updates could be incrementally monitored and documented to mitigate unexpected performance shifts.
Research Directions: This work encourages further research into understanding how specific training or fine-tuning methodologies may inadvertently alter model behavior across tasks. Studies could explore safeguard approaches to maintain consistency without sacrificing the overall performance improvements that updates aim to achieve.

Future Work

Future developments in AI research need to include expanding the set of tasks and metrics monitored as part of an exhaustive and long-term evaluation process. Identifying strategies to harmonize performance improvements with behavioral consistency will be critical. The release of evaluation data and codes by this paper provides a foundation for further research in this domain, facilitating broader community engagement in resolving these challenges.

Conclusion

In essence, this paper provides a critical analysis of the temporal variations in LLM behavior, illustrating substantial performance drifts and their implications. It brings to light the importance of understanding how LLM updates influence diverse application scenarios and underscores the need for systematic monitoring to maintain the reliability of these AI-driven systems.