- The paper demonstrates that combining LLMs with Bayesian models achieves near-human performance in generating informative Battleship questions.
- The LIPS framework translates natural language questions into symbolic programs to compute expected information gain and optimize sampling.
- Results indicate that textual board state inputs modestly boost efficiency, while visual representations do not significantly enhance performance.
Modeling Question Generation in a Grounded Task Using Language-Informed Program Sampling
Introduction
Understanding how humans generate informative questions within a constrained environment, like a game, presents a unique challenge at the intersection of cognitive science and artificial intelligence. This paper explores the efficiency of question generation in the classic board game Battleship using a novel framework named Language-Informed Program Sampling (LIPS). Here, the primary aim is to translate natural language questions into symbolic programs, enabling the computation of expected information gain (EIG) to ascertain the informativeness of questions.
Models and Experimentation
The paper introduces a dual-role model where LLMs serve as both a distribution over potential questions and as a mechanism for translating questions from natural language to a "language of thought" (LoT). This process effectively leverages the statistical properties of language to model human-like question generation. In particular, the models employ a probabilistic context-free grammar (PCFG) and two LLMs (GPT-4 and CodeLlama-7b) to generate initial question sets, which are then filtered to identify those that maximize EIG.
The LIPS model operates by sampling a subset of k questions, translating them into symbolic representations, and computing their informativeness through simulation. This sampling strategy controls the model's computational effort, allowing an investigation into how computational constraints affect question generation's efficiency and effectiveness.
Results
The paper's findings highlight the capability of LLMs, combined with Bayesian models, to approximate human-like performance in generating informative questions within the Battleship game context. Notably, even with a modest value of k, the models were able to reach near-human performance levels, suggesting that a relatively simple sampling-based approach can significantly capture the essence of human question-asking strategies. However, the models demonstrated some limitations related to grounding and producing redundant or uninformative questions.
Experimental results indicated that while textual representations of the board state modestly improved the efficiency of question generation, visual representations did not result in significant performance gains. This outcome implies challenges in LLMs' ability to extract and utilize structured visual information effectively. Furthermore, comparisons between models based on PCFG and LLM-provided priors revealed nuanced differences in the types of questions generated, underscoring the influence of diverse modeling approaches on question generation's inductive biases.
Theoretical and Practical Implications
Analyzing human-like question generation through the lens of LIPS offers several theoretical insights into how language and computational reasoning intertwine in information-seeking behaviors. Practically, this research could inform the development of AI systems capable of engaging in more human-like dialogues, particularly in educational, gaming, and interactive information retrieval systems.
Future Directions
The paper points towards several potential paths for future research, including exploring more sophisticated inference techniques for question selection and investigating the integration of multimodal data sources to enhance grounding capabilities. Additionally, applying the LIPS framework to more complex, multi-turn interaction contexts could further unveil the nuances of human question-asking strategies.
Conclusion
This paper contributes to our understanding of question generation in grounded tasks by introducing and evaluating the LIPS framework. The results underscore the potential of combining Bayesian models with LLMs to closely model human-like question generation, while also highlighting areas for future improvement and exploration.