Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy (2405.09306v1)
Abstract: Ensuring the effectiveness of search queries while protecting user privacy remains an open issue. When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system. Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts while maintaining satisfactory effectiveness. However, such approaches may protect the user's privacy only from a theoretical perspective while, in practice, the real user's information need can still be inferred if perturbed terms are too semantically similar to the original ones. We overcome such limitations by proposing Word Blending Boxes, a novel differentially private mechanism for query obfuscation, which protects the words in the user queries by employing safe boxes. To measure the overall effectiveness of the proposed WBB mechanism, we measure the privacy obtained by the obfuscation process, i.e., the lexical and semantic similarity between original and obfuscated queries. Moreover, we assess the effectiveness of the privatized queries in retrieving relevant documents from the IRS. Our findings indicate that WBB can be integrated effectively into existing IRSs, offering a key to the challenge of protecting user privacy from both a theoretical and a practical point of view.
- Intent-aware query obfuscation for privacy protection in personalized web search, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 285–294. URL: https://doi.org/10.1145/3209978.3209983, doi:10.1145/3209978.3209983.
- Geomasking sensitive health data and privacy protection: an evaluation using an e911 database. Geocarto international 25, 443–452.
- A versatile tool for privacy-enhanced web search, in: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S.M., Agichtein, E., Segalovich, I., Yilmaz, E. (Eds.), Advances in Information Retrieval - 35th European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24-27, 2013. Proceedings, Springer. pp. 368–379. URL: https://doi.org/10.1007/978-3-642-36973-5_31, doi:10.1007/978-3-642-36973-5\_31.
- Versatile query scrambling for private web search. Inf. Retr. J. 18, 331–358. URL: https://doi.org/10.1007/s10791-015-9256-0, doi:10.1007/S10791-015-9256-0.
- Enhancing deniability against query-logs, in: Clough, P.D., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Murdock, V. (Eds.), Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings, Springer. pp. 117–128. URL: https://doi.org/10.1007/978-3-642-20161-5_13, doi:10.1007/978-3-642-20161-5\_13.
- A query scrambler for search privacy on the internet. Inf. Retr. 16, 657–679. URL: https://doi.org/10.1007/s10791-012-9212-1, doi:10.1007/S10791-012-9212-1.
- NLTK: The natural language toolkit, in: Proceedings of the ACL Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, Barcelona, Spain. pp. 214–217. URL: https://aclanthology.org/P04-3031.
- TEM: high utility metric differential privacy on text, in: Shekhar, S., Zhou, Z., Chiang, Y., Stiglic, G. (Eds.), Proceedings of the 2023 SIAM International Conference on Data Mining, SDM 2023, Minneapolis-St. Paul Twin Cities, MN, USA, April 27-29, 2023, SIAM. pp. 883–890. URL: https://doi.org/10.1137/1.9781611977653.ch99, doi:10.1137/1.9781611977653.CH99.
- Preserving user’s privacy in web search engines. Computer Communications 32, 1541–1551. URL: https://www.sciencedirect.com/science/article/pii/S014036640900125X, doi:https://doi.org/10.1016/j.comcom.2009.05.009.
- Broadening the scope of differential privacy using metrics, in: De Cristofaro, E., Wright, M. (Eds.), Privacy Enhancing Technologies, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 82–102.
- A customized text sanitization mechanism with differential privacy, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada. pp. 5747–5758. URL: https://aclanthology.org/2023.findings-acl.355, doi:10.18653/v1/2023.findings-acl.355.
- The expressive power of word embeddings. CoRR abs/1301.3226. URL: http://arxiv.org/abs/1301.3226, arXiv:1301.3226.
- Overview of the TREC 2019 deep learning track. CoRR abs/2003.07820. URL: https://arxiv.org/abs/2003.07820, arXiv:2003.07820.
- The Design of Rijndael: AES - The Advanced Encryption Standard. Information Security and Cryptography, Springer. URL: https://doi.org/10.1007/978-3-662-04722-4, doi:10.1007/978-3-662-04722-4.
- BERT: pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics. pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423, doi:10.18653/V1/N19-1423.
- A survey of inference control methods for privacy-preserving data mining, in: Aggarwal, C.C., Yu, P.S. (Eds.), Privacy-Preserving Data Mining - Models and Algorithms. Springer. volume 34 of Advances in Database Systems, pp. 53–80. URL: https://doi.org/10.1007/978-0-387-70992-5_3, doi:10.1007/978-0-387-70992-5\_3.
- User-private information retrieval based on a peer-to-peer community. Data & Knowledge Engineering 68, 1237–1252. URL: https://www.sciencedirect.com/science/article/pii/S0169023X09000937, doi:https://doi.org/10.1016/j.datak.2009.06.004. including Special Section: Conference on Privacy in Statistical Databases (PSD 2008) – Six selected and extended papers on Database Privacy.
- The limits of differential privacy (and its misuse in data release and machine learning). Commun. ACM 64, 33–35. URL: https://doi.org/10.1145/3433638, doi:10.1145/3433638.
- H(k)-private information retrieval from privacy-uncooperative queryable databases.”¿h(k)-private information retrieval from privacy-uncooperative queryable databases. Online Inf. Rev. 33, 720–744. URL: https://api.semanticscholar.org/CorpusID:475785.
- Calibrating noise to sensitivity in private data analysis, in: Halevi, S., Rabin, T. (Eds.), Theory of Cryptography, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 265–284.
- The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407. URL: https://doi.org/10.1561/0400000042, doi:10.1561/0400000042.
- A new privacy model for web surfing, in: Halevy, A., Gal, A. (Eds.), Next Generation Information Technologies and Systems, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 45–57.
- Query obfuscation for information retrieval through differential privacy, in: Goharian, N., Tonellotto, N., He, Y., Lipani, A., McDonald, G., Macdonald, C., Ounis, I. (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham. pp. 278–294.
- Privacy- and utility-preserving textual analysis via calibrated multivariate perturbations, in: Proceedings of the 13th International Conference on Web Search and Data Mining, Association for Computing Machinery, New York, NY, USA. p. 178–186. URL: https://doi.org/10.1145/3336191.3371856, doi:10.1145/3336191.3371856.
- Leveraging hierarchical representations for preserving privacy and utility in text, in: 2019 IEEE International Conference on Data Mining (ICDM), IEEE Computer Society, Los Alamitos, CA, USA. pp. 210–219. URL: https://doi.ieeecomputersociety.org/10.1109/ICDM.2019.00031, doi:10.1109/ICDM.2019.00031.
- Private release of text embedding vectors, in: Pruksachatkun, Y., Ramakrishna, A., Chang, K.W., Krishna, S., Dhamala, J., Guha, T., Ren, X. (Eds.), Proceedings of the First Workshop on Trustworthy Natural Language Processing, Association for Computational Linguistics, Online. pp. 15–27. URL: https://aclanthology.org/2021.trustnlp-1.3, doi:10.18653/v1/2021.trustnlp-1.3.
- Efficient query obfuscation with keyqueries, in: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Association for Computing Machinery, New York, NY, USA. p. 154–161. URL: https://doi.org/10.1145/3486622.3493950, doi:10.1145/3486622.3493950.
- Mapping health data: improved privacy protection with donut method geomasking. American journal of epidemiology 172, 1062–1069.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling, in: Diaz, F., Shah, C., Suel, T., Castells, P., Jones, R., Sakai, T. (Eds.), SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, ACM. pp. 113–122. URL: https://doi.org/10.1145/3404835.3462891, doi:10.1145/3404835.3462891.
- Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res. 2022. URL: https://openreview.net/forum?id=jKN1pXi7b0.
- Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description. Technical Report. N/A. URL: https://arxiv.org/abs/2009.01265.
- METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments, in: Callison-Burch, C., Koehn, P., Fordyce, C.S., Monz, C. (Eds.), Proceedings of the Second Workshop on Statistical Machine Translation, WMT@ACL 2007, Prague, Czech Republic, June 23, 2007, Association for Computational Linguistics. pp. 228–231. URL: https://aclanthology.org/W07-0734/.
- Measuring political personalization of google news search, in: The World Wide Web Conference, Association for Computing Machinery, New York, NY, USA. p. 2957–2963. URL: https://doi.org/10.1145/3308558.3313682, doi:10.1145/3308558.3313682.
- Privacy awareness about information leakage: Who knows what about me?, in: Proceedings of the 12th ACM Workshop on Workshop on Privacy in the Electronic Society, Association for Computing Machinery, New York, NY, USA. p. 279–284. URL: https://doi.org/10.1145/2517840.2517868, doi:10.1145/2517840.2517868.
- Mechanism design via differential privacy, in: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, IEEE Computer Society. pp. 94–103. URL: https://doi.org/10.1109/FOCS.2007.41, doi:10.1109/FOCS.2007.41.
- Distributed representations of words and phrases and their compositionality, in: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119. URL: https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
- Wordnet: A lexical database for english. Commun. ACM 38, 39–41. URL: https://doi.org/10.1145/219717.219748, doi:10.1145/219717.219748.
- The case for voter-centered audits of search engines during political elections, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, New York, NY, USA. p. 559–569. URL: https://doi.org/10.1145/3351095.3372835, doi:10.1145/3351095.3372835.
- MS MARCO: A human generated machine reading comprehension dataset, in: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (Eds.), Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, CEUR-WS.org. URL: https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
- Word embedding-based antonym detection using thesauri and distributional information, in: Mihalcea, R., Chai, J., Sarkar, A. (Eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Denver, Colorado. pp. 984–989. URL: https://aclanthology.org/N15-1100, doi:10.3115/v1/N15-1100.
- Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, ACL. pp. 311–318. URL: https://aclanthology.org/P02-1040/, doi:10.3115/1073083.1073135.
- On the effectiveness of anonymizing networks for web search privacy, in: Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, Association for Computing Machinery, New York, NY, USA. p. 483–489. URL: https://doi.org/10.1145/1966913.1966984, doi:10.1145/1966913.1966984.
- Web search query privacy: Evaluating query obfuscation and anonymizing networks. J. Comput. Secur. 22, 155–199. URL: https://doi.org/10.3233/JCS-130491, doi:10.3233/JCS-130491.
- Semantics-preserved distortion for personal privacy protection. CoRR abs/2201.00965. URL: https://arxiv.org/abs/2201.00965, arXiv:2201.00965.
- GloVe: Global vectors for word representation, in: Moschitti, A., Pang, B., Daelemans, W. (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1532–1543. URL: https://aclanthology.org/D14-1162, doi:10.3115/v1/D14-1162.
- Making monolingual sentence embeddings multilingual using knowledge distillation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. URL: https://arxiv.org/abs/2004.09813.
- A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21, 120–126. URL: https://doi.org/10.1145/359340.359342, doi:10.1145/359340.359342.
- Okapi at TREC-3, in: Harman, D.K. (Ed.), Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, National Institute of Standards and Technology (NIST). pp. 109–126. URL: http://trec.nist.gov/pubs/trec3/papers/city.ps.gz.
- A vector space model for automatic indexing. Commun. ACM 18, 613–620. URL: https://doi.org/10.1145/361219.361220, doi:10.1145/361219.361220.
- Algorithmic privacy and gender bias issues in google ad settings, in: Proceedings of the 10th ACM Conference on Web Science, Association for Computing Machinery, New York, NY, USA. p. 281–285. URL: https://doi.org/10.1145/3292522.3326033, doi:10.1145/3292522.3326033.
- k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10, 557–570. URL: https://doi.org/10.1142/S0218488502001648, doi:10.1142/S0218488502001648.
- Attention is all you need, in: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Overview of the TREC 2004 robust track, in: Voorhees, E.M., Buckland, L.P. (Eds.), Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004, National Institute of Standards and Technology (NIST). URL: http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf.
- Technical privacy metrics: A systematic survey. ACM Comput. Surv. 51. URL: https://doi.org/10.1145/3168389, doi:10.1145/3168389.
- Syntf: Synthetic and differentially private term frequency vectors for privacy-preserving text mining, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 305–314. URL: https://doi.org/10.1145/3209978.3210008, doi:10.1145/3209978.3210008.
- A differentially private text perturbation method using regularized mahalanobis metric, in: Feyisetan, O., Ghanavati, S., Malmasi, S., Thaine, P. (Eds.), Proceedings of the Second Workshop on Privacy in NLP, Association for Computational Linguistics, Online. pp. 7–17. URL: https://aclanthology.org/2020.privatenlp-1.2, doi:10.18653/v1/2020.privatenlp-1.2.
- On a utilitarian approach to privacy preserving text generation. CoRR abs/2104.11838. doi:10.48550/ARXIV.2104.11838, arXiv:2104.11838.
- Differential privacy for text analytics via natural text sanitization, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, Association for Computational Linguistics. pp. 3853–3866. URL: https://doi.org/10.18653/v1/2021.findings-acl.337, doi:10.18653/V1/2021.FINDINGS-ACL.337.
- A survey on differential privacy for unstructured data content. ACM Comput. Surv. 54. URL: https://doi.org/10.1145/3490237, doi:10.1145/3490237.
- Towards search strategies for better privacy and information, in: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, Association for Computing Machinery, New York, NY, USA. p. 124–134. URL: https://doi.org/10.1145/3343413.3377958, doi:10.1145/3343413.3377958.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.