Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations (2402.11770v2)

Published 19 Feb 2024 in cs.CL

Abstract: We introduce a structured chain-of-thought (SCoT) prompting approach to generating content-grounded multi-turn question-answer conversations using a pre-trained LLM. At the core of our proposal is a structured breakdown of the complex task into a number of states in a state machine, so that actions corresponding to various subtasks, e.g., content reading and utterance generation, can be executed in their own dedicated states. Each state leverages a unique set of resources including prompts and (optionally) additional tools to augment the generation process. Our experimental results show that SCoT prompting with designated states for hallucination mitigation increases agent faithfulness to grounding documents by up to 16.8%. When used as training data, our open-domain conversations synthesized from only 6 Wikipedia-based seed demonstrations train strong conversational QA agents; in out-of-domain evaluation, for example, we observe improvements of up to 13.9% over target domain gold data when the latter is augmented with our generated examples.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces Structured Chain-of-Thought Prompting to decompose multi-turn QA into distinct states, enhancing document grounding for LLMs.
It employs a state machine architecture with specialized algorithms, such as Falcon and Flan-UL, to improve answer accuracy and reduce hallucinations by up to 16.8%.
The experiments demonstrate that SC prompting outperforms unstructured methods and can augment human-labeled data for training in document-specific QA tasks.

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

This paper presents a new approach called Structured Chain-of-Thought (SC) Prompting, designed to enhance the generation of content-grounded multi-turn Q&A conversations using pre-trained LLMs. The paper specifically addresses the challenge of closed-domain hallucinations, where models, even when explicitly instructed, fail to generate text grounded in a provided document.

Figure 1: A multi-turn QA conversation grounded in a document. If the document does not have an answer to a user query, the agent acknowledges so in its response.*

Structured Chain-of-Thought (SC) Prompting Approach

SC prompting involves breaking down the complex task of generating document-grounded multi-turn QA conversations into a series of states in a state machine. Each state corresponds to specific subtasks like content reading and utterance generation, executed with dedicated resources such as prompts and optional tools.

State Machine Architecture

Figure 2: State machine for generating a single user-agent utterance pair within a multi-turn conversation \S{\ref{section:preliminaries}}.

The state machine comprises four states:

User Utterance Generation (UU): Generates the next user query within an ongoing conversation grounded in the document.
Question Answerability Classification (AC): It ascertains whether the given document contains an answer to the user query, thus mitigating hallucinated answers.
Answer Sentence Selection (SS): Identifies relevant sentences in the document that likely contain information pertinent to a user's query.
Agent Utterance Generation (AU): Generates a response to the user's query, factoring in its grounding to selected document sentences if applicable.

Two algorithms, Falcon and Flan-UL, are utilized for various tasks in the state machine. While Falcon handles user and agent utterance generation, Flan-UL assists in question answerability classification and answer sentence selection where needed.

Figure 3: Prompts for states au and ss. Left: Agent utterance generation (au) with a pre-trained LLM. Right: Answer sentence selection (ss) with an instruction-following LLM. This diagram only shows 1-shot prompts for brevity; we use more demonstrations in practice.

Methodology for Multi-Turn QA Generation

The paper investigates five algorithms, delineated by their state transition sequences and the resources used in the respective states. Notably, two state transition sequences were implemented: one with a two-state sequence } and another full sequence }.

A focal point is the distinction between reading tasks (ac, ss) and generation tasks (uu, au), probing where pre-trained LLMs require additional support. For instance, question answerability classification (AC) was performed either with the same pre-trained LLM used for generation or with an instruction-tuned model.

Performance Evaluation

In the experiments conducted, the structured chain-of-thought (SC) prompted methods that deployed an instruction-tuned assistant in the AC and SS states considerably reduced hallucinations -- up to 16.8% -- compared to an unstructured } approach (Table~\ref{table:sft-results}). Furthermore, in few-shot prompting evaluation with an LLM as the agent, data generated with the SC prompted model with Flan-ul demonstrated a superior balance of performance on both answerable/Unanswerable classes.

Conclusion

This paper extends structured chain-of-thought prompting to the synthesis of multi-turn, content-grounded QA conversations using few-shot LLM prompting. Experimental analysis demonstrates the efficacy of structured CoT prompting in both intrinsic and extrinsic evaluations over baseline methods, even facilitating improvements over conventional human-labeled data when used in data augmentation for training. This signifies strong potential for the approach to bolster LLM performance in document-specific QA tasks, ultimately enhancing faithfulness and accuracy. Future work could explore its extension to more complex multi-document and multi-lingual scenarios.