Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation (2312.00645v2)

Published 1 Dec 2023 in cs.LG, cs.CR, and cs.SE

Abstract: There is a growing need to gain insight into LLM capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating LLMs in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.

References (42)

Summary

The paper introduces the hashmarking protocol, which transforms correct answers using cryptographic hashing and salting to safeguard sensitive evaluation data.
It outlines robust security measures that prevent brute-force and rainbow table attacks by incorporating slow hashing techniques and unique salting.
The paper discusses limitations and emerging threats, emphasizing the importance of unambiguous questions and potential cryptographic enhancements.

Introduction to Privacy-Preserving Benchmarks

Privacy-preserving benchmarks for LLMs are critical in sensitive domains where disclosing the correct answers can have serious implications, such as bioterrorism or cyberwarfare. Traditional benchmarks, where correct answers are published, pose a potential threat by serving as public compendia on sensitive topics. There is a need for a new approach that allows for the evaluation of AI systems without revealing correct answers on such subjects.

The Hashmarking Protocol

Concept and Motivation

Hashmarking offers a solution for evaluating sensitive domains through benchmarks whose correct answers are converted to a hashed format using cryptographic methods. Hashing transforms data into a fixed-length string of characters, which cannot be easily reversed to reveal the original data. This process ensures that while AI models can be tested against the benchmarks, the sensitive information remains concealed.

Implementation and Security

Experts first contribute questions and answers within their domain. The correct answers are hashed and salted—adding a unique string to each answer before hashing—to prevent precomputed attack strategies like rainbow tables. These hashed answers are then collected and filtered for quality and consensus before being shared openly, allowing for model testing without exposing sensitive information.

Security Measures

The security of hashmarks against traditional cyber attacks—such as brute-force, dictionary, and rainbow table attacks—is hardened using slow hashing techniques and unique salting, which increase the computational time and resources needed to crack the hashes. Consequently, brute-force attacks would require prohibitive resources, thus disincentivizing would-be attackers.

Limitations and Emerging Threats

Certain precautions are necessary when creating hashmarks. Questions should have narrow and unambiguous answers to avoid brute-force discovery. Moreover, as AI systems become increasingly capable, novel threats may emerge. Some generative models might learn to prioritize the likelihood of certain answers, "learning" the correct information without it being directly disclosed. Additionally, hashmarks cannot guarantee against deception by generative models or false reporting of results by third parties. Attention hazards and psychological phenomena like the Streisand effect are concerns that require careful management to prevent the unintended spread of sensitive information.

Conclusion and Future Outlook

In conclusion, hashmarking represents a promising direction for safely assessing the risk and capabilities of AI systems regarding sensitive information. While not infallible, hashmarks provide a robust protocol that balances the need for model evaluation with the imperative for information security. Future iterations may incorporate advanced cryptographic techniques to further enhance the protocol's resilience against novel and sophisticated attacks. This approach is seen as one of many in a diverse toolkit needed to evaluate high-stakes AI capabilities responsibly.