Papers
Topics
Authors
Recent
2000 character limit reached

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation (2312.00645v2)

Published 1 Dec 2023 in cs.LG, cs.CR, and cs.SE

Abstract: There is a growing need to gain insight into LLM capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating LLMs in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Holistic Evaluation of Language Models, October 2023. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs].
  2. Measuring Mathematical Problem Solving With the MATH Dataset, November 2021a. URL http://arxiv.org/abs/2103.03874. arXiv:2103.03874 [cs].
  3. ANALYSING MATHEMATICAL REASONING ABILITIES OF NEURAL MODELS. 2019.
  4. Measuring Massive Multitask Language Understanding, January 2021b. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs].
  5. The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge. Scientific Reports, 13(1):7240, May 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-33607-z. URL https://www.nature.com/articles/s41598-023-33607-z. Number: 1 Publisher: Nature Publishing Group.
  6. PubMedQA: A Dataset for Biomedical Research Question Answering, September 2019. URL http://arxiv.org/abs/1909.06146. arXiv:1909.06146 [cs, q-bio].
  7. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, March 2018. URL http://arxiv.org/abs/1803.05457. arXiv:1803.05457 [cs].
  8. Aligning AI With Shared Human Values, February 2023. URL http://arxiv.org/abs/2008.02275. arXiv:2008.02275 [cs].
  9. UnifiedQA: Crossing Format Boundaries With a Single QA System, October 2020. URL http://arxiv.org/abs/2005.00700. arXiv:2005.00700 [cs].
  10. What Will it Take to Fix Benchmarking in Natural Language Understanding?, October 2021. URL http://arxiv.org/abs/2104.02145. arXiv:2104.02145 [cs].
  11. SCROLLS: Standardized CompaRison Over Long Language Sequences, October 2022. URL http://arxiv.org/abs/2201.03533. arXiv:2201.03533 [cs, stat].
  12. QuALITY: Question Answering with Long Input Texts, Yes!, May 2022. URL http://arxiv.org/abs/2112.08608. arXiv:2112.08608 [cs].
  13. Evaluate, a. URL https://huggingface.co/docs/evaluate/index.
  14. Model evaluation for extreme risks, September 2023. URL http://arxiv.org/abs/2305.15324. arXiv:2305.15324 [cs].
  15. Password Storage - OWASP Cheat Sheet Series, b. URL https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html.
  16. M. Naor and M. Yung. Universal one-way hash functions and their cryptographic applications. In Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC ’89, pages 33–43, Seattle, Washington, United States, 1989. ACM Press. ISBN 978-0-89791-307-2. doi: 10.1145/73007.73011. URL http://portal.acm.org/citation.cfm?doid=73007.73011.
  17. Argon2: New Generation of Memory-Hard Functions for Password Hashing and Other Applications | IEEE Conference Publication | IEEE Xplore, c. URL https://ieeexplore.ieee.org/document/7467361.
  18. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine, 37(3):50–60, May 2020. ISSN 1053-5888, 1558-0792. doi: 10.1109/MSP.2020.2975749. URL https://ieeexplore.ieee.org/document/9084352/.
  19. Layer-wised Model Aggregation for Personalized Federated Learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10082–10091, June 2022. doi: 10.1109/CVPR52688.2022.00985. URL https://ieeexplore.ieee.org/document/9880164/. Conference Name: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ISBN: 9781665469463 Place: New Orleans, LA, USA Publisher: IEEE.
  20. Calibrating Noise to Sensitivity in Private Data Analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, Lecture Notes in Computer Science, pages 265–284, Berlin, Heidelberg, 2006. Springer. ISBN 978-3-540-32732-5. doi: 10.1007/11681878_14.
  21. David Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, February 1988. ISSN 0095-2338. doi: 10.1021/ci00057a005. URL https://doi.org/10.1021/ci00057a005. Publisher: American Chemical Society.
  22. Brute-force and dictionary attack on hashed real-world passwords. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 1161–1166, May 2018. doi: 10.23919/MIPRO.2018.8400211. URL https://ieeexplore.ieee.org/document/8400211.
  23. Password Cracking with Brute Force Algorithm and Dictionary Attack Using Parallel Programming. Applied Sciences, 13(10):5979, January 2023. ISSN 2076-3417. doi: 10.3390/app13105979. URL https://www.mdpi.com/2076-3417/13/10/5979. Number: 10 Publisher: Multidisciplinary Digital Publishing Institute.
  24. A Future-Adaptable Password Scheme.
  25. Colin Percival. STRONGER KEY DERIVATION VIA SEQUENTIAL MEMORY-HARD FUNCTIONS.
  26. Philippe Oechslin. Making a Faster Cryptanalytic Time-Memory Trade-Off. volume 2729, pages 617–630, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-40674-7 978-3-540-45146-4. doi: 10.1007/978-3-540-45146-4_36. URL http://link.springer.com/10.1007/978-3-540-45146-4_36. Book Title: Advances in Cryptology - CRYPTO 2003 Series Title: Lecture Notes in Computer Science.
  27. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
  28. Hierarchical Neural Story Generation, May 2018. URL http://arxiv.org/abs/1805.04833. arXiv:1805.04833 [cs].
  29. The Curious Case of Neural Text Degeneration, February 2020. URL http://arxiv.org/abs/1904.09751. arXiv:1904.09751 [cs].
  30. Cryptographic Accumulators: Definitions, Constructions and Applications. 2002. URL https://www.semanticscholar.org/paper/Cryptographic-Accumulators%3A-Definitions%2C-and-Fazio-Nicolosi/a611cef6f0391bd5a8eec61b5cf0e1e1896a0dae.
  31. TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022. URL http://arxiv.org/abs/2109.07958. arXiv:2109.07958 [cs].
  32. Discovering Latent Knowledge in Language Models Without Supervision, December 2022. URL http://arxiv.org/abs/2212.03827. arXiv:2212.03827 [cs].
  33. The alignment problem from a deep learning perspective, September 2023. URL http://arxiv.org/abs/2209.00626. arXiv:2209.00626 [cs].
  34. Activation Addition: Steering Language Models Without Optimization, November 2023. URL http://arxiv.org/abs/2308.10248. arXiv:2308.10248 [cs].
  35. Representation Engineering: A Top-Down Approach to AI Transparency, October 2023. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
  36. A Review of zk-SNARKs, October 2023. URL http://arxiv.org/abs/2202.06877. arXiv:2202.06877 [cs].
  37. Scalable, transparent, and post-quantum secure computational integrity.
  38. Amit Sabne. XLA : Compiling Machine Learning for Peak Performance, 2020.
  39. N. Bostrom. INFORMATION HAZARDS: A TYPOLOGY OF POTENTIAL HARMS FROM KNOWLEDGE. 2011. URL https://www.semanticscholar.org/paper/INFORMATION-HAZARDS%3A-A-TYPOLOGY-OF-POTENTIAL-HARMS-Bostrom/274c17084e5373a854b13a39c45d072e2b09970e.
  40. S. C. Jansen and B. Martin. The Streisand Effect and Censorship Backfire. International Journal of Communication, February 2015. URL https://www.semanticscholar.org/paper/The-Streisand-Effect-and-Censorship-Backfire-Jansen-Martin/626538c63976db5d87a3da081c1ea83671e3bacc.
  41. Alan Cullison. Inside Al-Qaeda’s Hard Drive. The Atlantic, September 2004. ISSN 2151-9463. URL https://www.theatlantic.com/magazine/archive/2004/09/inside-al-qaeda-s-hard-drive/303428/. Section: Global.
  42. Have I Been Pwned: Check if your email has been compromised in a data breach, d. URL https://haveibeenpwned.com/.

Summary

  • The paper introduces the hashmarking protocol, which transforms correct answers using cryptographic hashing and salting to safeguard sensitive evaluation data.
  • It outlines robust security measures that prevent brute-force and rainbow table attacks by incorporating slow hashing techniques and unique salting.
  • The paper discusses limitations and emerging threats, emphasizing the importance of unambiguous questions and potential cryptographic enhancements.

Introduction to Privacy-Preserving Benchmarks

Privacy-preserving benchmarks for LLMs are critical in sensitive domains where disclosing the correct answers can have serious implications, such as bioterrorism or cyberwarfare. Traditional benchmarks, where correct answers are published, pose a potential threat by serving as public compendia on sensitive topics. There is a need for a new approach that allows for the evaluation of AI systems without revealing correct answers on such subjects.

The Hashmarking Protocol

Concept and Motivation

Hashmarking offers a solution for evaluating sensitive domains through benchmarks whose correct answers are converted to a hashed format using cryptographic methods. Hashing transforms data into a fixed-length string of characters, which cannot be easily reversed to reveal the original data. This process ensures that while AI models can be tested against the benchmarks, the sensitive information remains concealed.

Implementation and Security

Experts first contribute questions and answers within their domain. The correct answers are hashed and salted—adding a unique string to each answer before hashing—to prevent precomputed attack strategies like rainbow tables. These hashed answers are then collected and filtered for quality and consensus before being shared openly, allowing for model testing without exposing sensitive information.

Security Measures

The security of hashmarks against traditional cyber attacks—such as brute-force, dictionary, and rainbow table attacks—is hardened using slow hashing techniques and unique salting, which increase the computational time and resources needed to crack the hashes. Consequently, brute-force attacks would require prohibitive resources, thus disincentivizing would-be attackers.

Limitations and Emerging Threats

Certain precautions are necessary when creating hashmarks. Questions should have narrow and unambiguous answers to avoid brute-force discovery. Moreover, as AI systems become increasingly capable, novel threats may emerge. Some generative models might learn to prioritize the likelihood of certain answers, "learning" the correct information without it being directly disclosed. Additionally, hashmarks cannot guarantee against deception by generative models or false reporting of results by third parties. Attention hazards and psychological phenomena like the Streisand effect are concerns that require careful management to prevent the unintended spread of sensitive information.

Conclusion and Future Outlook

In conclusion, hashmarking represents a promising direction for safely assessing the risk and capabilities of AI systems regarding sensitive information. While not infallible, hashmarks provide a robust protocol that balances the need for model evaluation with the imperative for information security. Future iterations may incorporate advanced cryptographic techniques to further enhance the protocol's resilience against novel and sophisticated attacks. This approach is seen as one of many in a diverse toolkit needed to evaluate high-stakes AI capabilities responsibly.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.