Prompt-tuning Mistral-7B-Instruct to Reach REAPER-level Performance

Determine whether in-context prompt tuning, without any fine-tuning, of the Mistral-7B-Instruct-v0.2 language model can achieve high-accuracy retrieval plan generation for the REAPER task in a conversational shopping assistant, specifically producing correct tool sequences and arguments across the six defined retrieval classes (Customer Support, Shipment Status, Product Search, Product QnA, Review Summary, and No-retrieval) at levels comparable to the fine-tuned REAPER model (approximately 96% tool selection accuracy and 92% argument accuracy).

Background

The paper evaluates whether strong in-context learning and prompt design alone can suffice for the specialized retrieval planning task required by REAPER, which generates tool sequences and parameters for multi-step evidence retrieval in a conversational shopping assistant. Despite extensive effort, the authors report that prompt tuning of Mistral-7B-Instruct-v0.2 did not meet their target performance and the model remained prone to hallucinations without fine-tuning.

The fine-tuned REAPER model achieves high tool selection and argument generation accuracy across six retrieval classes, suggesting that fine-tuning may be necessary for robust, instruction-following plan generation. This leaves open whether purely prompt-based approaches can reach similar accuracy and reliability without model fine-tuning.

References

Despite several weeks worth of effort, we could not prompt-tune Mistral-7B-Instruct-v0.2 to reach the target performance.

REAPER: Reasoning based Retrieval Planning for Complex RAG Systems (2407.18553 - Joshi et al., 26 Jul 2024) in Section 6.1 (Comparison with Open Models)