I still know it's you! On Challenges in Anonymizing Source Code (2208.12553v2)
Abstract: The source code of a program not only defines its semantics but also contains subtle clues that can identify its author. Several studies have shown that these clues can be automatically extracted using machine learning and allow for determining a program's author among hundreds of programmers. This attribution poses a significant threat to developers of anti-censorship and privacy-enhancing technologies, as they become identifiable and may be prosecuted. An ideal protection from this threat would be the anonymization of source code. However, neither theoretical nor practical principles of such an anonymization have been explored so far. In this paper, we tackle this problem and develop a framework for reasoning about code anonymization. We prove that the task of generating a $k$-anonymous program -- a program that cannot be attributed to one of $k$ authors -- is not computable in the general case. As a remedy, we introduce a relaxed concept called $k$-uncertainty, which enables us to measure the protection of developers. Based on this concept, we empirically study candidate techniques for anonymization, such as code normalization, coding style imitation, and code obfuscation. We find that none of the techniques provides sufficient protection when the attacker is aware of the anonymization. While we observe a notable reduction in attribution performance on real-world code, a reliable protection is not achieved for all developers. We conclude that code anonymization is a hard problem that requires further attention from the research community.
- Deep learning with differential privacy. In Proc. of the ACM Conference on Computer and Communications Security (CCS), pages 308–318, 2016.
- Large-scale and language-oblivious code authorship identification. In Proc. of the ACM Conference on Computer and Communications Security (CCS), pages 101–114, 2018.
- Doppelgänger finder: Taking stylometry to the underground. In Proc. of the IEEE Symposium on Security and Privacy, pages 212–226, 2014.
- Compilers Principles, Techniques, and Tools. Addison-Wesley, second edition, 2006.
- On leveraging coding habits for effective binary authorship attribution. In Proc. of the European Symposium on Research in Computer Security (ESORICS), pages 26–47, 2018.
- Source code authorship attribution using long short-term memory based networks. In Proc. of the European Symposium on Research in Computer Security (ESORICS), pages 65–82, 2017.
- Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning. In Proc. of the USENIX Security Symposium, pages 661–678, 2017.
- Authorship attribution of source code: a language-agnostic approach and applicability in software engineering. In Proc. of the European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 932–944, 8 2021.
- R. Brant. China’s VPN developers face crackdown. BBC News Service, https://www.bbc.com/news/blogs-china-blog-40872486, accessed December 2023, 2017.
- Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information Systm Security, 15(3):12:1–12:22, 2012.
- De-anonymizing programmers via code stylometry. In Proc. of the USENIX Security Symposium, pages 255–270, 2015.
- When coding style survives compilation: De-anonymizing programmers from executable binaries. In Proc. of the Network and Distributed System Security Symposium (NDSS), 2018.
- Differentially private empirical risk minimization. Journal of Machine Learning Research, page 1069–1109, 2011.
- Chromium Project. Chromium coding style. https://www.chromium.org/developers/coding-style/, accessed April 2023, 2022.
- clang18. Clang: C language family frontend for LLVM. LLVM Project, https://clang.llvm.org, 2018.
- C. Collberg. The Tigress C Obfuscator. Project website: https://tigress.wtf, accessed April 2023, 2023.
- Frama-c: A software analysis perspective. In Proc. of the International Conference on Software Engineering and Formal Methods, 2012.
- Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. Proc. of the Privacy Enhancing Technologies Symposium (PETS), 2019(3):389–408, 2019.
- Learning stylometric representations for authorship analysis. IEEE Trans. Cybern., 49(1):107–121, 2019.
- Tor: The second-generation onion router. In Proc. of the USENIX Security Symposium, pages 303–320, 2004.
- C. Dwork. Differential privacy. In Automata, Languages and Programming, pages 1–12, 2006.
- Firefox Project. Firefox coding style. https://firefox-source-docs.mozilla.org/code-quality/coding-style/index.html, accessed April 2023, 2023.
- Towards protecting sensitive text with differential privacy. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 468–475, 2021.
- R. Goldblatt and M. Jackson. Well-structured program equivalence is highly undecidable. ACM Transactions on Computational Logic (TOCL), 13(3), 2012.
- Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.
- Google. Google code jam. https://codingcompetitions.withgoogle.com/codejam, accessed April 2023, 2023.
- Differential privacy under fire. In Proc. of the USENIX Security Symposium, 2011.
- Composing differential privacy and secure computation: A case study on scaling private record linkage. In Proc. of the ACM Conference on Computer and Communications Security (CCS), pages 1389–1406, 2017.
- B. Jayaraman and D. Evans. Evaluating differentially private machine learning in practice. In Proc. of the USENIX Security Symposium, pages 1895–1912, 2019.
- Code authorship attribution: Methods and challenges. ACM Computing Surveys, 52(1):1–36, 2020.
- Rozzle: De-cloaking internet malware. In Proc. of the IEEE Symposium on Security and Privacy, pages 443–457, 2012.
- t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pages 106–115, 2007.
- Linux Project. Linux kernel coding style. https://www.kernel.org/doc/html/v4.10/process/coding-style.html, accessed April 2023, 2023.
- A practical black-box attack on source code authorship identification classifiers. IEEE Transactions on Informations Forensics and Security, 16:3620–3633, 2021.
- Differentially private representation for NLP: Formal guarantee and an empirical study on privacy and fairness. In Findings of the Association for Computational Linguistics: EMNLP, pages 2355–2365, 2020.
- l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3–es, 2007.
- A girl has no name: Automated authorship obfuscation using mutant-x. Proc. of the Privacy Enhancing Technologies Symposium (PETS), 2019(4):54–71, 2019.
- Introduction to information retrieval. Cambridge University Press, 2008.
- The limits of word level differential privacy. In Findings of the Association for Computational Linguistics: NAACL, pages 867–881, 2022.
- Adversarial authorship attribution in open-source projects. In Proc. of the ACM Conference on Data and Application Security and Privacy (CODASPY), pages 291–302, 2019.
- Use fewer instances of the letter "i": Toward writing style anonymization. Proc. of the Privacy Enhancing Technologies Symposium (PETS), 7384:299–318, 2012.
- J. Nagra and C. Collberg. Surreptitious Software: Obfuscation, Watermarking, and Tamperproofing for Software Protection. Addison-Wesley, 1 edition, 2009.
- A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Proc. of the IEEE Symposium on Security and Privacy, pages 111–125, 2008.
- Net4People Forum. Many popular censorship circumvention tools deleted or archived since November 2, 2023. Forum on Github, https://github.com/net4people/bbs/issues/303, accessed December 2023, 2023.
- An empirical study of the i2p anonymity network and its censorship resistance. In Proc. of the Internet Measurement Conference (IMC), 2018.
- B. N. Pellin. Using classification techniques to determine source code authorship. Technical report, Department of Computer Science, University of Wisconsin, 2000.
- Misleading authorship attribution of source code using adversarial learning. In Proc. of the USENIX Security Symposium, pages 479–496, 2019.
- Reporters Without Borders. Anti-censorship blogger sentenced to seven years for “subversion”. RSF Blog, https://rsf.org/en/china-anti-censorship-blogger-sentenced-seven-years-subversion, accessed December 2023, 2023.
- A. Saabas. Treeinterpreter. Project repository: https://github.com/andosa/treeinterpreter, accessed April 2023, 2023.
- Recognizing and imitating programmer style: Adversaries in program authorship attribution. Proc. of the Privacy Enhancing Technologies Symposium (PETS), 2018(1):127–144, 2018.
- J. Singh. Anti-censorship tools are quietly disappearing into thin air in China. TechCrunch, https://techcrunch.com/2023/11/21/china-censorship-circumvention-tools-clash-disappear/?guccounter=1, accessed December 2023, 2023.
- Breaking the closed-world assumption in stylometric authorship attribution. In Proc. of IFIP International Conference on Digital Forensics, pages 185–205, 2014.
- Stunnix. C/C++ Obfuscator. Project website: http://stunnix.com/prod/cxxo/, accessed April 2023, 2023.
- Differentially private k-means clustering. In Proc. of the ACM Conference on Data and Application Security and Privacy (CODASPY), pages 26–37, 2016.
- L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
- Towards improving code stylometry analysis in underground forums. Proc. of the Privacy Enhancing Technologies Symposium (PETS), 2022(1):126–147, 2022.
- Sok: Deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proc. of the IEEE Symposium on Security and Privacy, pages 659–673, 2015.
- Integration of static and dynamic code stylometry analysis for programmer de-anonymization. In Proc. of the ACM Workshop on Artificial Intelligence and Security, pages 74–84, 2018.
- Evaluating explanation methods for deep learning in security. In Proc. of the IEEE European Symposium on Security and Privacy, pages 158–174, 2020.
- B. Weggenmann and F. Kerschbaum. Syntf: Synthetic and differentially private term frequency vectors for privacy-preserving text mining. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 305–314, 2018.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. of the European Conference on Computer Vision (ECCV), pages 818–833, 2014.