Model Equality Testing: Which Model Is This API Serving? (2410.20247v2)

Published 26 Oct 2024 in cs.LG

Abstract: Users often interact with LLMs through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

Collections

Summary

The paper formalizes model equality testing as a two-sample problem to detect modifications in API-served language models.
It employs a string kernel-based maximum mean discrepancy test, achieving a median power of 77.4% with just ten samples per prompt.
Empirical evaluation across nine APIs reveals significant deviations in 11 of 31 endpoints, underscoring the need for API transparency.

Model Equality Testing: Determining API Model Fidelity

The paper "Model Equality Testing: Which Model Is This API Serving?" addresses a significant challenge faced by users interacting with LLMs through black-box inference APIs. These users often have limited visibility into whether the models served have been altered through quantization, fine-tuning, or other modifications. This issue can result in discrepancies between the model intended and what is actually delivered, affecting user experience and downstream applications. The authors propose a systematic approach to detect such discrepancies through a novel concept termed Model Equality Testing, leveraging Maximum Mean Discrepancy (MMD) for statistical analysis.

Key Contributions

Formalization of Model Equality Testing: The authors define a framework to detect differences in applied models by formalizing the task as a two-sample testing problem. This method involves comparing sampled outputs from the API with a reference distribution, typically the original model, to statistically ascertain deviations.
MMD-Based Testing: The paper identifies MMD as a powerful statistical tool within this framework. It proposes the use of a string kernel-based MMD to handle high-dimensional distributions typical of LLMs, allowing users to personal audit APIs on custom applications efficiently.
Empirical Validation Across Commercial APIs: The method is applied to APIs from nine providers offering inference services for Meta’s Llama models. Surprisingly, the detection flagged that 11 out of 31 endpoints deviate significantly from reference distributions, highlighting discrepancies between the models publicly available and those actually served by these APIs.

Numerical Results and Analysis

The authors provide compelling numerical results demonstrating the efficacy of their approach. Specifically, the MMD test with a simple string kernel achieves a median power of 77.4% against several distortions with just ten samples per prompt. Such powerful results suggest that even minimal sampling can yield meaningful insights into API discrepancies.

Furthermore, the paper extends the utility of MMD beyond binary tests, demonstrating its effectiveness in estimating statistical distances between different black-box endpoints. This capability allows researchers to compare output distributions between various LLMs, revealing critical insights into model fidelity and deviation from declared or reference models.

Implications and Future Outlooks

The outlined testing framework has both practical and theoretical implications. Practically, it empowers users to independently verify the fidelity of APIs, leading to improved trust and transparency in commercial services. Theoretically, it opens avenues for further research into robust auditing mechanisms for LLMs, especially as these models become integrated into increasingly sensitive and complex applications.

Future research may look into developing more sophisticated string kernels or other statistical tests that could improve the sensitivity of model equality testing, especially in low-sample scenarios or when dealing with more subtle model modifications. Further collaboration between academia and industry might yield improved standards and frameworks for model quality assurance that are vital for maintaining integrity and trust in the growing AI ecosystem.

In conclusion, "Model Equality Testing: Which Model Is This API Serving?" presents a robust approach to detecting deviations in API-served models, offering critical tools for researchers and practitioners concerned with model transparency and fidelity in an emerging AI landscape. The insights derived from their empirical work lay foundational steps toward more accountable and transparent AI deployment through commercial APIs.