Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models (2310.15147v2)

Published 23 Oct 2023 in cs.CL

Abstract: The rapid development of LLMs has led to great strides in model capabilities like long-context understanding and reasoning. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval and real-world benchmarks demonstrates the soundness of using S3Eval for evaluation of LLMs. S3Eval provides a flexible and infinite long-context data generation method. We have generated a comprehensive dataset called S3Eval-Standard, and experimental results have shown that it poses significant challenges for all existing LLMs.

Citations (12)

View on Semantic Scholar