Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents (2305.13303v3)
Abstract: Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked LLM. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences
- SemEval-2016 task 2: Interpretable semantic textual similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 512–524, San Diego, California. Association for Computational Linguistics.
- It’s easier to translate out of English than into it: Measuring neural translation difficulty by cross-mutual information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1640–1649, Online. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Measuring and increasing context usage in context-aware machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6467–6478, Online. Association for Computational Linguistics.
- Translation error detection as rationale extraction. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4148–4159, Dublin, Ireland. Association for Computational Linguistics.
- Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1627–1643, Online. Association for Computational Linguistics.
- Toward interpretable semantic textual similarity via optimal transport-based contrastive sentence learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5969–5979, Dublin, Ireland. Association for Computational Linguistics.
- Christoph Wolfgang Leiter. 2021. Reference-free word- and sentence-level translation evaluation with token-matching metrics. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 157–164, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Detecting relevant differences between similar legal texts. In Proceedings of the Natural Legal Language Processing Workshop 2022, pages 256–264, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Interpretable semantic textual similarity: Finding and explaining differences between sentences. Knowledge-Based Systems, 119:186–199.
- Putting evaluation in context: Contextual embeddings improve machine translation evaluation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2799–2808, Florence, Italy. Association for Computational Linguistics.
- Detecting semantic equivalence and information disparity in cross-lingual documents. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 120–124, Jeju Island, Korea. Association for Computational Linguistics.
- The inside story: Towards better understanding of machine translation neural evaluation metrics. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1089–1105, Toronto, Canada. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Marko Robnik-Šikonja and Igor Kononenko. 2008. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5):589–600.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
- How do words contribute to sentence semantics? revisiting sentence embeddings with a perturbation method. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3001–3010, Dubrovnik, Croatia. Association for Computational Linguistics.
- Findings of the WMT 2022 shared task on quality estimation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.