Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Can LLMs Ask Good Questions? (2501.03491v2)

Published 7 Jan 2025 in cs.CL and cs.AI

Abstract: We evaluate questions generated by LLMs from context, comparing them to human-authored questions across six dimensions: question type, question length, context coverage, answerability, uncommonness, and required answer length. Our study spans two open-source and two proprietary state-of-the-art models. Results reveal that LLM-generated questions tend to demand longer descriptive answers and exhibit more evenly distributed context focus, in contrast to the positional bias often seen in QA tasks. These findings provide insights into the distinctive characteristics of LLM-generated questions and inform future work on question quality and downstream applications.

Summary

The paper demonstrates that LLM-generated questions emphasize descriptive inquiry and require longer, more detailed answers compared to human-generated questions.
It employs a six-dimension evaluation using a filtered WikiText dataset to assess context coverage, answerability, and other metrics.
Findings inform future LLM refinements and prompt design, highlighting potential improvements for retrieval-augmented systems and educational applications.

An Evaluation of Question Generation Using LLMs

The paper provides a comprehensive exploration into the field of question generation (QG) using LLMs, assessing key dimensions in which these automatically generated questions can be evaluated against their human-generated counterparts. Focusing on six primary dimensions—question length, type, context coverage, answerability, uncommonness, and required answer length—this paper underscores the distinct characteristics of LLM-generated questions in the context of a diverse Wikipedia corpus.

Methodological Overview

The research employs LLMs such as GPT-4o and LLaMA-3.1-70b-Instruct to generate questions from a given context, followed by an evaluation of these questions across six dimensions. The generation process is guided by a detailed instruction prompt aimed at ensuring self-contained, context-independent question formulation. The paper details the use of the WikiText dataset, filtered and adapted into 860,000 paragraphs, serving as the source for question generation. The paper implements an LLM-based classification scheme to categorize questions into ten predefined types, revealing a predilection in LLMs toward generating questions that solicit descriptive, longer answers. In contrast to human-generated questions in datasets like TriviaQA and HotpotQA, LLM questions frequently align more closely with broader fact-driven inquiries.

Key Findings and Numerical Insights

The quantitative evaluation presented in the paper yields significant insights:

Question Types and Context Coverage: The LLMs demonstrated a tendency to ask descriptive questions (e.g., thematic or character-based inquiries) more than factual or confirmatory ones typically generated by humans. This is evident in tasks like jewelry diagnosis (34.2% TriviaQA) for Identity/Attribution category but only 15.7% by GPT-4o.
Question and Answer Lengths: The generated questions averaged around 20 words, comparable to human-annotated datasets, but answers demanded by LLM questions were substantially longer. For instance, answers derived using the prompts were often twice as extensive as their human equivalents in QG benchmarks.
Contextual Engagement: Human-generated questions engaged more extensively with the context, reflecting a nuanced comprehension often spread throughout the context. In contrast, LLMs exhibited balanced focus across the context, reducing positional bias—a finding contrary to trends observed in QA tasks.
Answerability: The paper also distinguishes between questions that require context-specific details versus general internet knowledge. It finds 25% of questions unanswerable without context, validating these questions for testing retrieval-based AI systems.

Practical and Theoretical Implications

The implications of these findings are multifaceted:

The paper not only advances our understanding of LLM behavior in QG tasks but also suggests pathways for future work in tailoring LLMs for specific applications, emphasizing precision in prompt design and question utility in educational settings or dialogue systems.
Automating QG processes using LLMs offers practical efficiency, yet the paper indicates refinement in structures to match the depth and balance seen in human-generated tasks.
On the theoretical side, this paper opens avenues for further exploration into optimizing LLM prompt engineering, potentially enhancing the alignment of automated systems with human evaluative benchmarks.

Future Directions

The paper sets the stage for various explorative pathways. Future research could explore expanding the capabilities of LLMs in specialized domains, like healthcare or technical support, where contextual understanding and question diversity become paramount. Additionally, advancing methods to seamlessly integrate LLMs into retrieval-augmented generation (RAG) systems could significantly benefit scenarios that require precision and rich informational synthesis.

In conclusion, this paper enriches the dialogue on intelligent automated question generation, inviting ongoing analysis and evolution of LLM-based QG methodologies. Its findings inform a nuanced understanding of machine and human comparative analyses, contributing to a more sophisticated deployment of LLMs in natural language processing tasks.