Revisiting Data Auditing in Large Vision-Language Models (2504.18349v1)

Published 25 Apr 2025 in cs.CV and cs.CR

Abstract: With the surge of LLMs, Large Vision-LLMs (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

Summary

Revisiting Data Auditing in Large Vision-LLMs

The paper "Revisiting Data Auditing in Large Vision-LLMs" provides a comprehensive examination of the process and challenges associated with data auditing in Vision-LLMs (VLMs). These models integrate vision encoders with LLMs to achieve advanced visual grounding and have significant applications in areas such as generalist agent creation and robotic control. The paper addresses critical concerns regarding the legality of data used in training VLMs, such as copyright issues and privacy breaches, and emphasizes the necessity for effective data auditing mechanisms.

Membership Inference Challenges and Biases

Membership inference (MI) is a primary technique used in data auditing to determine whether specific data samples were part of a model's training set. MI has demonstrated promising results in open-source VLMs, such as LLaVA, achieving an AUC greater than 80%. However, the paper identifies a major flaw in existing MI benchmarks—namely, distribution shifts between member and non-member datasets which artificially inflate MI performance by introducing bias. These shifts stem from temporal changes, discrepancies between real and synthetic image sources, or inherent differences in distribution between datasets. As a result, current MI methods can outperform state-of-the-art classifiers without genuine signals of overfitting or memorization.

Distribution Discrepancy Analysis

The researchers propose the use of optimal transport-based metrics to quantify distribution discrepancies more accurately, providing a metric named WiRED to assess distribution differences. WiRED measures distributional biases in visual datasets by comparing the Wasserstein distance between dataset samples, offering an efficient method for highlighting unintended biases. This analysis reveals how MI benchmarks have consistently exploited distribution shortcuts, questioning the reliability of membership inference results.

Evaluation of Membership Inference Techniques

Within unbiased conditions with i.i.d. member and non-member datasets, current MI approaches fail to deliver reliable performance, performing only marginally better than random guesswork. This highlights the subtle nature of membership signals in VLM outputs, revealing a pessimistic outlook on MI viability when relying on conventional methods.

Exploration of Practical Data Auditing Scenarios

Despite the challenges presented, the paper explores several practical scenarios where membership inference can still flourish. These include situations involving multi-epoch fine-tuning, access to original ground-truth texts, and aggregation-based set inference—all contexts where auditing may be feasible and useful. Each scenario presents solutions to mitigate overfitting constraints, occupy genuine membership signals, or deal with intrinsic image attributes, making MI a potent tool for data auditing.

Conclusion and Implications

This work ultimately provides a systematic view of the limitations and opportunities of using membership inference attacks for trustworthy data auditing on VLMs. It encourages future research to enhance MI methods, expand unbiased data benchmarks, and explore extended modalities to incorporate comprehensive data transparency practices in AI development. The insights gained from this research could shape how the field tackles privacy concerns and data legality, especially as large vision-LLMs become increasingly prevalent.