Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation (2112.03473v1)

Published 7 Dec 2021 in cs.CL

Abstract: Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.

Citations (38)

View on Semantic Scholar

Summary

The paper introduces a novel knowledge distillation framework using Sinkhorn Divergence to better align monolingual and cross-lingual representations.
It demonstrates significant performance improvements in distant language pairs such as English-Chinese, English-Arabic, and Japanese-English over state-of-the-art methods.
Sinkhorn Divergence robustly captures geometric relationships between hidden states, effectively mitigating issues in representing languages with distinct morphological features.

The paper "Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation" (Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation, 2021) addresses challenges in cross-lingual summarization (CLS), particularly when dealing with distant languages exhibiting disparate morphological and structural features. The paper posits that contemporary state-of-the-art models, which commonly employ multi-task learning paradigms and self-attention mechanisms, are often inadequate in capturing crucial cross-lingual representations, leading to diminished performance, especially when applied to languages with distinct characteristics. To mitigate this, the paper introduces a Knowledge Distillation (KD) framework leveraging Sinkhorn Divergence, an Optimal-Transport distance, to estimate the representational discrepancy between monolingual teacher and cross-lingual student models.

Problem Statement and Motivation

Existing CLS models often rely on shared vocabulary modules and self-attention mechanisms to correlate tokens across languages. However, the paper argues that the correlations learned by self-attention are frequently loose and implicit, rendering them inefficient in capturing robust cross-lingual representations. This issue is exacerbated when processing languages with significant morphological or structural divergence, which complicates cross-lingual alignment and results in a notable performance drop. The authors motivate the use of knowledge distillation to explicitly construct cross-lingual correlations by transferring knowledge from a monolingual summarization teacher to a cross-lingual summarization student, thereby enhancing the alignment between languages.

Proposed Methodology

The proposed KD framework consists of two primary components: a monolingual summarization teacher model and a cross-lingual summarization student model. The teacher model is initially fine-tuned on monolingual document/summary pairs. Subsequently, the knowledge encapsulated within the teacher model is distilled into the student model, explicitly fostering cross-lingual correlation.

Sinkhorn Divergence for Knowledge Distillation

A core contribution of this work is the application of Sinkhorn Divergence, an Optimal-Transport distance, as the KD loss function. This choice is predicated on the observation that the teacher and student representations reside in distinct vector spaces (i.e., monolingual versus cross-lingual). Sinkhorn Divergence offers several advantages in this context:

It does not necessitate that the distributions reside within the same probability space.
It exhibits robustness to noise.
It does not impose constraints on sample size.
It effectively captures geometric relationships between hidden states.

These properties enable the student model to productively align its cross-lingual hidden states with the monolingual hidden states of the teacher, thus establishing a robust correlation between distant languages. The Sinkhorn divergence is computed as follows:

$\text{Sinkhorn Divergence} = \text{OT}_{\epsilon}(a, b) = \min_{P \in \Pi(a,b)} \langle P, C \rangle - \epsilon H(P)$

where $a$ and $b$ are probability vectors representing the teacher and student hidden states, $C$ is a cost matrix, $\epsilon$ is a regularization parameter, $H(P)$ is the entropy of the transport plan $P$ , and $\Pi(a,b)$ is the set of transport plans with marginals $a$ and $b$ .

Experimental Results

The efficacy of the proposed methodology is substantiated through experimentation on CLS datasets comprising pairs of distant languages (e.g., English-to-Chinese, English-to-Arabic, Japanese-to-English). The experimental results demonstrate that the proposed method surpasses state-of-the-art models in both high-resource and low-resource settings, as evaluated by automatic metrics such as ROUGE. Furthermore, human evaluation experiments indicate a preference for the summaries generated by the proposed model. Comparative analyses against alternative KD loss functions, such as cosine similarity and mean squared error, underscore the superiority of Sinkhorn Divergence. A case paper further illustrates the model's enhanced ability to preserve key information from the original documents compared to baseline models.

PDF Markdown