CoDi: Conversational Distillation for Grounded Question Answering (2408.11219v1)

Published 20 Aug 2024 in cs.CL and cs.AI

Abstract: Distilling conversational skills into Small LLMs (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.

Summary

The paper introduces CoDi, a data distillation framework that significantly enhances small language models for grounded question answering.
It leverages conversational graph generation, turn-based prompt augmentation, and linguistic integration to synthesize diverse multi-turn datasets.
Experiments show CoDi narrows the performance gap with human annotations, achieving high recall scores in conversational reasoning benchmarks.

CoDi: Conversational Distillation for Grounded Question Answering

The paper "CoDi: Conversational Distillation for Grounded Question Answering" introduces an innovative data distillation framework termed CoDi (Conversational Distillation). The primary aim of CoDi is to enhance the performance of Small LLMs (SLMs), which have around 1 billion parameters, for the task of conversational grounded reasoning in question answering. This addresses two main challenges: the limited capacity of SLMs to retain extensive knowledge and the scarcity of high-quality conversational datasets.

Core Contributions and Methodology

The CoDi framework is designed to synthesize large-scale, diverse, and steerable assistant-style datasets. While it is task-agnostic at its core, the paper specifically focuses on its application to conversational grounded reasoning. Unlike general LLMs, SLMs are positioned as task specialists. The CoDi framework leverages a distillation process from teacher LLMs to student SLMs, enriching their conversational abilities without requiring them to memorize extensive world knowledge. This ensures that SLMs can remain performant even with their limited capacity.

CoDi includes several novel methodological components:

Conversational Graph Generation: Inspired by Markov Chains, CoDi generates valid and diverse conversational blueprints through conversational graphs. These graphs represent conversation templates, ensuring a coherent flow in multi-turn conversations.
Turn-Based Prompt Augmentation: This involves enhancing the data generation process via prompt templates and optional seed data inputs. The prompts are designed to generate new conversation turns step-by-step using a large 'DataGen LLM'.
Linguistic Phenomena Integration: Explicit linguistic features are embedded into the synthesis process to produce more natural and diverse conversations.

Experimental Setup

The paper extensively evaluates the CoDi framework using both intra-domain and zero-shot scenarios. The intra-domain experiments use existing datasets such as CoQA and QuAC to compare synthesized data performance against human-annotated datasets. For zero-shot evaluations, web data is used for synthesizing large-scale conversational data.

Key Model Configurations:

Teacher Model: A 70B parameter Llama3 instruction-tuned model.
Student Model: A pre-trained 1.4B Llama2-style model, with additional evaluations using a 500M parameter student model.
Baselines: Human-annotated datasets (both single and multi-turn), instruction-tuned models (ranging from 1.4B to 70B parameters), and selected models from literature (such as Phi-3).

Strong Numerical Results

The results indicate that the CoDi framework significantly reduces the performance gap between single-turn and multi-turn human annotations. Performance metrics in terms of recall on the CoQA dataset show that CoDi achieves 84.5% recall in the predicted conversation scenario, and 91.0% for the zero-shot web model using a 7B parameter student. This surpasses the performance of instruction-tuned models and aligns closely with human multi-turn datasets. For grounded reasoning tasks, CoDi consistently outperforms other instruction-tuned baselines.

Analysis

The paper conducts several ablations to understand the impact of various factors:

Scale of Distillation Data: Increasing the number of synthesized conversations (up to 1 million samples) positively correlates with improved model performance.
Model Size: Comparison of different student model sizes shows that larger models (1.4B) perform better; yet, even smaller models (500M) trained with CoDi data perform competitively.
Teacher Models: Using the Llama3 as a teacher model results in better student performance compared to Llama2.
Per-Turn Analysis: CoDi demonstrates robust performance across multiple conversational turns, maintaining higher recall than other baselines in longer conversations.

Implications and Future Work

CoDi represents a significant contribution to the effective training of SLMs for specialized tasks like conversational grounded reasoning. By leveraging large-scale synthesized data, CoDi circumvents the limitations of human-annotated datasets, promoting scalability and cost-efficiency. This approach shows promise for other data-scarce applications where generating human-like interactions is crucial.

Future directions could explore further enhancements in synthesis techniques, integrating more sophisticated linguistic phenomena, and extending the framework to other conversation-based tasks. Additionally, investigating how CoDi impacts real-world performance on edge devices, which often operate under computational constraints, could provide deeper insights into its practical applicability.

CoDi sets a new benchmark in the efficient training of SLMs, ensuring they remain relevant and high-performing in specific conversational tasks. This research presents an alternative pathway to scaling up model size, showcasing the potential of intelligent data distillation and synthesis in advancing AI capabilities.