Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The False Promise of Imitating Proprietary LLMs (2305.15717v1)

Published 25 May 2023 in cs.CL

Abstract: An emerging method to cheaply improve a weaker LLM is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), data sources, and imitation data amounts (0.3M--150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models -- they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to none of the gap from the base LM to ChatGPT on tasks that are not heavily supported in the imitation data. We show that these performance discrepancies may slip past human raters because imitation models are adept at mimicking ChatGPT's style but not its factuality. Overall, we conclude that model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.

Citations (184)

Summary

  • The paper demonstrates that imitation via fine-tuning on proprietary outputs fails to bridge the inherent capability gap between open-source and advanced LLMs.
  • It reveals that while imitation may improve superficial instruction-following, it significantly compromises the models' factual reliability.
  • The findings advocate for enhancing base models directly through larger architectures and better pre-training rather than relying solely on imitation.

An Analytical Overview of "The False Promise of Imitating Proprietary LLMs"

The paper "The False Promise of Imitating Proprietary LLMs" embarks on a critical analysis of a cost-effective approach aimed at augmenting less potent LLMs through fine-tuning on outputs derived from more robust, perhaps proprietary, systems such as ChatGPT. The authors meticulously evaluate the efficacy of this methodology, corroborating their findings through a series of methodical experiments.

At the crux of the investigation is the practice of utilizing proprietary model outputs to fine-tune open-source models, ostensibly to match the former's capabilities. This research explores this concept extensively by finetuning different-sized base models (ranging from 1.5B to 13B parameters) using varied amounts of imitation data (0.3M to 150M tokens). The ensuing models were then assessed using both crowd raters and traditional NLP benchmarks.

Initial observations seemed promising, highlighting an apparent enhancement in the models' ability to follow instructions. Human evaluations, especially from crowd workers, placed these models on a competitive pedestal against ChatGPT. Nonetheless, upon more granular automatic evaluations, the research suggests a stark deviation in outcomes; imitation models, despite their ability to mimic the semblance of ChatGPT's style, struggled significantly with preserving factual content. The crux of this failure stems from the imitation models' inadequacy in terms of knowledge and capabilities, which remain markedly inferior without extensive and impractical amounts of imitation data or inherently more capable base models.

The paper posits that the central issue lies in the substantive capability gap between open-source and proprietary models, a gap that cannot be conveniently bridged through imitation alone. The more efficient strategy, argued by the authors, would be to focus efforts on enhancing the inherent capacities of open-source models, such as through enlargement of base models or improving pre-training data quality.

The contribution of this work lies in its critical exploration of imitation as a strategy, determining that while imitation might help mask weaker instructions adherence with confidence and organization, it fails to substantiate or enhance deeper model capabilities significantly. Furthermore, the research introduces an important discourse around model safety and toxicity transfer through imitation, suggesting that imitation models effectively inherit the safety and toxicity handling characteristics of their source models like ChatGPT.

From a research and application standpoint, this work offers substantive implications. It underscores the futility in relying blanketly on imitation for competitive model development against proprietary systems. Simultaneously, it advocates for a more resource-intensive yet potentially rewarding focus on foundational model development. For future AI developments, the emphasis should perhaps rest on fundamentally enhancing the base models themselves rather than merely attempting to replicate the outputs of more powerful systems. The paper advocates a shift in research tactics in response to the realization that imitation under current practices is insufficient for achieving comprehensive capability parity with proprietary LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com