Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization (2405.08460v2)

Published 14 May 2024 in cs.CL and cs.AI

Abstract: The rapid advancement of LLMs highlights the urgent need for evolving evaluation methodologies that keep pace with improvements in language comprehension and information processing. However, traditional benchmarks, which are often static, fail to capture the continually changing information landscape, leading to a disparity between the perceived and actual effectiveness of LLMs in ever-changing real-world scenarios. Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts, revealing significant temporal biases in LLMs. We propose an evaluation framework, for dynamically generating benchmarks from recent real-world predictions. Experiments demonstrate that LLMs struggle with temporal generalization, showing performance decline over time. These findings highlight the necessity for improved training and updating processes to enhance adaptability and reduce biases. Our code, dataset and benchmark are available at https://github.com/FreedomIntelligence/FreshBench.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chenghao Zhu (9 papers)
  2. Nuo Chen (100 papers)
  3. Yufei Gao (10 papers)
  4. Benyou Wang (109 papers)
  5. Yunyi Zhang (39 papers)
  6. Prayag Tiwari (41 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets