The paper introduces the Sequential Question Answering Reasoning Engine (SQuARE), a novel prompting technique designed to enhance reasoning in LLMs. SQuARE builds upon Chain-of-Thought (CoT) frameworks, instructing models to generate and resolve multiple auxiliary questions before addressing the main query. The approach aims to promote a more thorough exploration of various facets of a topic.
The SQuARE technique alters system instructions to prompt the model to generate a set of N question-and-answer pairs. The rationale is to guide the model into an iterative cycle of inquiry and response, encouraging it to explore various facets of a topic before forming a conclusion. Unlike standard CoT prompts, which often present a single stream of reasoning, SQuARE nudges the model toward self-interrogation pathways. The value of N can be tuned to balance the thoroughness of exploration with computational cost and response length.
The authors conducted experiments using Llama-3.2 3B, Llama-3.1 8B, and GPT-4o models across the TriviaQA, HotpotQA, and ASQA datasets. These datasets are knowledge-intensive question-answering datasets which benefit from external context. Context retrieval was performed over a Wikipedia corpus. For TriviaQA and HotpotQA sub-string exact match (subEM) is reported, while for ASQA, recall-EM is reported.
The experimental setup is composed of the following configuration settings:
- Baseline: Standard application without any augmentative techniques.
- CoT: Methodology that leverages intermediate reasoning steps leading to a final answer.
- RaR: A rephrasing strategy that prompts for a rephrasing of the original request before answering it.
- SQuARE: Employs the SQuARE prompt and is run with a default N=3 question-answer pairs.
In configurations containing reasoning instructions, a regular expression is employed to extract the final answer to assist in mitigating incorrect answers when correct phrases appear throughout reasoning chains but not in the final answer.
The results indicate that across the smaller Llama 3.2 3B and Llama 3.1 8B models, SQuARE consistently outperforms or matches the strongest baselines in each dataset. For example, with Llama 3.2 3B on TriviaQA, SQuARE improves performance by 6.5% and 2.5% over RAG and RaR, respectively, achieving an overall score of 88.5%. On HotpotQA, Llama 3.2 3B also sees a notable boost, from 26.5% (CoT) to 31.5% with SQuARE. These gains become even more pronounced with Llama 3.1 8B, where improvements of up to 3% (TriviaQA) and 7% (HotpotQA) are observed compared to alternative methods. They also observe notable gains on ASQA. For Llama-3.2 3B, SQuARE lifts performance from 21.5% (RAG) and 23.5% (RaR) to 26.6%, nearly doubling the baseline of 14.2%. When using GPT-4o, SQuARE remains highly competitive. On TriviaQA, SQuARE reaches 96.7%, outperforming other settings by at least 2.0%. On HotpotQA, RaR and SQuARE are close, with RaR exhibiting a slight edge (47.3% versus 46.7%). For ASQA, CoT and SQuARE yield nearly identical performance (31.9% versus 31.7%), indicating that GPT-4o is already adept at leveraging additional reasoning steps or retrieved facts in these tasks.
To highlight the contribution of each component in SQuARE, the authors performed an ablation paper analyzing (1) the number of generated questions (N), (2) the role of few-shot examples, and (3) an optional aggregation step.
The evaluation using N∈{3,5,10} shows that for TriviaQA, increasing N from 3 to 5 or 10 boosts performance from 92.5% to 94.0%. On HotpotQA, N=5 (31.5%) dips slightly below N=3, but returns to 33.5% at N=10. In ASQA, performance drops from 28.8% at N=3 to 27.8% at N=10, suggesting that while additional questions can add useful context, they can also introduce redundancy or noise. Incorporating few-shot examples substantially boosts accuracy. Both CoT and SQuARE benefit strongly from these examples, indicating that better exposure to task-relevant scenarios helps the model generate answers with correct and properly formed final answers. Two aggregation strategies were also explored before producing the final answer: Summarize and Vote. The Summarize method involves the model summarizing the information learned from the generated questions and answers, whereas the Vote method relies on majority voting to determine the final answer. Summarize generally outperforms Vote on TriviaQA and HotpotQA. However, using no aggregation step outperforms both in nearly all instances, suggesting that further post-processing can sometimes hurt the quality of the answer.