Distilling Multi-Scale Knowledge for Event Temporal Relation Extraction (2209.00568v3)
Abstract: Event Temporal Relation Extraction (ETRE) is paramount but challenging. Within a discourse, event pairs are situated at different distances or the so-called proximity bands. The temporal ordering communicated about event pairs where at more remote (i.e., long'') or less remote (i.e.,
short'') proximity bands are encoded differently. SOTA models have tended to perform well on events situated at either short or long proximity bands, but not both. Nonetheless, real-world, natural texts contain all types of temporal event-pairs. In this paper, we present MulCo: Distilling Multi-Scale Knowledge via Contrastive Learning, a knowledge co-distillation approach that shares knowledge across multiple event pair proximity bands to improve performance on all types of temporal datasets. Our experimental results show that MulCo successfully integrates linguistic cues pertaining to temporal reasoning across both short and long proximity bands and achieves new state-of-the-art results on several ETRE benchmark datasets.
- Severing the edge between before and after: Neural architectures for temporal ordering of events. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5412–5417.
- Severing the edge between before and after: Neural architectures for temporal ordering of events. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5412–5417, Online. Association for Computational Linguistics.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Relational graph attention networks. arXiv preprint arXiv:1904.05811.
- An annotation framework for dense event ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 501–506.
- Nate Chambers. Timebank-dense implementation. https://github.com/nchambers/caevo/blob/master/src/main/java/caevo/Evaluate.java.
- Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758.
- Fei Cheng and Yusuke Miyao. 2017. Classifying temporal relations by bidirectional lstm over dependency paths. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–6.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- A survey on ensemble learning. Frontiers of Computer Science, 14:241–258.
- How powerful are k-hop message passing graph neural networks. arXiv preprint arXiv:2205.13328.
- Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
- Lrc-bert: latent-representation contrastive knowledge distillation for natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12830–12838.
- Deep structured neural network for event temporal relation extraction. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 666–106.
- Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 434–444.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
- More than classification: A unified framework for event temporal relation extraction. arXiv preprint arXiv:2305.17607.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- Reformer: The efficient transformer. In International Conference on Learning Representations.
- Towards cross-modality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 775–783.
- Mixkd: Towards efficient distillation of large-scale language models. In International Conference on Learning Representations.
- Multi-granularity structural knowledge distillation for language model compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1001–1011.
- Discourse-level event temporal ordering with uncertainty-guided graph completion. In IJCAI, pages 3871–3877.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Selecting optimal context sentences for event-event relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11058–11066.
- Timers: document-level temporal relation extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 524–533.
- Tddiscourse: A dataset for discourse-level temporal ordering of events. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 239–249.
- A structured learning approach to temporal relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1027–1037.
- An improved neural baseline for temporal relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6203–6209.
- A multi-axis annotation scheme for event temporal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1318–1328.
- Kenta Oono and Taiji Suzuki. 2019. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations.
- Timeml: Robust specification of event and temporal expressions in text. New directions in question answering, 3:28–34.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer.
- Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332.
- Contrastive distillation on intermediate representations for language model compression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 498–508.
- Contrastive representation distillation. In International Conference on Learning Representations.
- Joint constrained learning for event-event relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 696–706.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297.
- Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328.
- Rsgt: Relational structure guided temporal relation extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2001–2010.
- Hao-Ren Yao (7 papers)
- Luke Breitfeller (3 papers)
- Aakanksha Naik (23 papers)
- Chunxiao Zhou (3 papers)
- Carolyn Rose (32 papers)