Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models (2502.09390v1)

Published 13 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: In the rapidly evolving field of Natural Language Processing, LLMs are tasked with increasingly complex reasoning challenges. Traditional methods like chain-of-thought prompting have shown promise but often fall short in fully leveraging a model's reasoning capabilities. This paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a novel prompting technique designed to improve reasoning through a self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query, promoting a more thorough exploration of various aspects of a topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods. By systematically decomposing queries, SQuARE advances LLM capabilities in reasoning tasks. The code is publicly available at https://github.com/IntelLabs/RAG-FiT/tree/square.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Daniel Fleischer (9 papers)
  2. Moshe Berchansky (8 papers)
  3. Gad Markovits (1 paper)
  4. Moshe Wasserblat (22 papers)

Summary

The paper introduces the Sequential Question Answering Reasoning Engine (SQuARE), a novel prompting technique designed to enhance reasoning in LLMs. SQuARE builds upon Chain-of-Thought (CoT) frameworks, instructing models to generate and resolve multiple auxiliary questions before addressing the main query. The approach aims to promote a more thorough exploration of various facets of a topic.

The SQuARE technique alters system instructions to prompt the model to generate a set of NN question-and-answer pairs. The rationale is to guide the model into an iterative cycle of inquiry and response, encouraging it to explore various facets of a topic before forming a conclusion. Unlike standard CoT prompts, which often present a single stream of reasoning, SQuARE nudges the model toward self-interrogation pathways. The value of NN can be tuned to balance the thoroughness of exploration with computational cost and response length.

The authors conducted experiments using Llama-3.2 3B, Llama-3.1 8B, and GPT-4o models across the TriviaQA, HotpotQA, and ASQA datasets. These datasets are knowledge-intensive question-answering datasets which benefit from external context. Context retrieval was performed over a Wikipedia corpus. For TriviaQA and HotpotQA sub-string exact match (subEM) is reported, while for ASQA, recall-EM is reported.

The experimental setup is composed of the following configuration settings:

  • Baseline: Standard application without any augmentative techniques.
  • CoT: Methodology that leverages intermediate reasoning steps leading to a final answer.
  • RaR: A rephrasing strategy that prompts for a rephrasing of the original request before answering it.
  • SQuARE: Employs the SQuARE prompt and is run with a default N=3N = 3 question-answer pairs.

In configurations containing reasoning instructions, a regular expression is employed to extract the final answer to assist in mitigating incorrect answers when correct phrases appear throughout reasoning chains but not in the final answer.

The results indicate that across the smaller Llama 3.2 3B and Llama 3.1 8B models, SQuARE consistently outperforms or matches the strongest baselines in each dataset. For example, with Llama 3.2 3B on TriviaQA, SQuARE improves performance by 6.5% and 2.5% over RAG and RaR, respectively, achieving an overall score of 88.5%. On HotpotQA, Llama 3.2 3B also sees a notable boost, from 26.5% (CoT) to 31.5% with SQuARE. These gains become even more pronounced with Llama 3.1 8B, where improvements of up to 3% (TriviaQA) and 7% (HotpotQA) are observed compared to alternative methods. They also observe notable gains on ASQA. For Llama-3.2 3B, SQuARE lifts performance from 21.5% (RAG) and 23.5% (RaR) to 26.6%, nearly doubling the baseline of 14.2%. When using GPT-4o, SQuARE remains highly competitive. On TriviaQA, SQuARE reaches 96.7%, outperforming other settings by at least 2.0%. On HotpotQA, RaR and SQuARE are close, with RaR exhibiting a slight edge (47.3% versus 46.7%). For ASQA, CoT and SQuARE yield nearly identical performance (31.9% versus 31.7%), indicating that GPT-4o is already adept at leveraging additional reasoning steps or retrieved facts in these tasks.

To highlight the contribution of each component in SQuARE, the authors performed an ablation paper analyzing (1) the number of generated questions (NN), (2) the role of few-shot examples, and (3) an optional aggregation step.

The evaluation using N{3,5,10}N \in \{3, 5, 10\} shows that for TriviaQA, increasing NN from 3 to 5 or 10 boosts performance from 92.5% to 94.0%. On HotpotQA, N=5N = 5 (31.5%) dips slightly below N=3N = 3, but returns to 33.5% at N=10N = 10. In ASQA, performance drops from 28.8% at N=3N = 3 to 27.8% at N=10N = 10, suggesting that while additional questions can add useful context, they can also introduce redundancy or noise. Incorporating few-shot examples substantially boosts accuracy. Both CoT and SQuARE benefit strongly from these examples, indicating that better exposure to task-relevant scenarios helps the model generate answers with correct and properly formed final answers. Two aggregation strategies were also explored before producing the final answer: Summarize and Vote. The Summarize method involves the model summarizing the information learned from the generated questions and answers, whereas the Vote method relies on majority voting to determine the final answer. Summarize generally outperforms Vote on TriviaQA and HotpotQA. However, using no aggregation step outperforms both in nearly all instances, suggesting that further post-processing can sometimes hurt the quality of the answer.