- The paper introduces a novel knowledge distillation framework using Sinkhorn Divergence to better align monolingual and cross-lingual representations.
- It demonstrates significant performance improvements in distant language pairs such as English-Chinese, English-Arabic, and Japanese-English over state-of-the-art methods.
- Sinkhorn Divergence robustly captures geometric relationships between hidden states, effectively mitigating issues in representing languages with distinct morphological features.
The paper "Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation" (Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation, 2021) addresses challenges in cross-lingual summarization (CLS), particularly when dealing with distant languages exhibiting disparate morphological and structural features. The paper posits that contemporary state-of-the-art models, which commonly employ multi-task learning paradigms and self-attention mechanisms, are often inadequate in capturing crucial cross-lingual representations, leading to diminished performance, especially when applied to languages with distinct characteristics. To mitigate this, the paper introduces a Knowledge Distillation (KD) framework leveraging Sinkhorn Divergence, an Optimal-Transport distance, to estimate the representational discrepancy between monolingual teacher and cross-lingual student models.
Problem Statement and Motivation
Existing CLS models often rely on shared vocabulary modules and self-attention mechanisms to correlate tokens across languages. However, the paper argues that the correlations learned by self-attention are frequently loose and implicit, rendering them inefficient in capturing robust cross-lingual representations. This issue is exacerbated when processing languages with significant morphological or structural divergence, which complicates cross-lingual alignment and results in a notable performance drop. The authors motivate the use of knowledge distillation to explicitly construct cross-lingual correlations by transferring knowledge from a monolingual summarization teacher to a cross-lingual summarization student, thereby enhancing the alignment between languages.
Proposed Methodology
The proposed KD framework consists of two primary components: a monolingual summarization teacher model and a cross-lingual summarization student model. The teacher model is initially fine-tuned on monolingual document/summary pairs. Subsequently, the knowledge encapsulated within the teacher model is distilled into the student model, explicitly fostering cross-lingual correlation.
Sinkhorn Divergence for Knowledge Distillation
A core contribution of this work is the application of Sinkhorn Divergence, an Optimal-Transport distance, as the KD loss function. This choice is predicated on the observation that the teacher and student representations reside in distinct vector spaces (i.e., monolingual versus cross-lingual). Sinkhorn Divergence offers several advantages in this context:
- It does not necessitate that the distributions reside within the same probability space.
- It exhibits robustness to noise.
- It does not impose constraints on sample size.
- It effectively captures geometric relationships between hidden states.
These properties enable the student model to productively align its cross-lingual hidden states with the monolingual hidden states of the teacher, thus establishing a robust correlation between distant languages. The Sinkhorn divergence is computed as follows:
Sinkhorn Divergence=OTϵ(a,b)=P∈Π(a,b)min⟨P,C⟩−ϵH(P)
where a and b are probability vectors representing the teacher and student hidden states, C is a cost matrix, ϵ is a regularization parameter, H(P) is the entropy of the transport plan P, and Π(a,b) is the set of transport plans with marginals a and b.
Experimental Results
The efficacy of the proposed methodology is substantiated through experimentation on CLS datasets comprising pairs of distant languages (e.g., English-to-Chinese, English-to-Arabic, Japanese-to-English). The experimental results demonstrate that the proposed method surpasses state-of-the-art models in both high-resource and low-resource settings, as evaluated by automatic metrics such as ROUGE. Furthermore, human evaluation experiments indicate a preference for the summaries generated by the proposed model. Comparative analyses against alternative KD loss functions, such as cosine similarity and mean squared error, underscore the superiority of Sinkhorn Divergence. A case paper further illustrates the model's enhanced ability to preserve key information from the original documents compared to baseline models.