A Re-ranker Scheme for Integrating Large Scale NLU models

Published 25 Sep 2018 in cs.CL | (1809.09605v1)

Abstract: Large scale Natural Language Understanding (NLU) systems are typically trained on large quantities of data, requiring a fast and scalable training strategy. A typical design for NLU systems consists of domain-level NLU modules (domain classifier, intent classifier and named entity recognizer). Hypotheses (NLU interpretations consisting of various intent+slot combinations) from these domain specific modules are typically aggregated with another downstream component. The re-ranker integrates outputs from domain-level recognizers, returning a scored list of cross domain hypotheses. An ideal re-ranker will exhibit the following two properties: (a) it should prefer the most relevant hypothesis for the given input as the top hypothesis and, (b) the interpretation scores corresponding to each hypothesis produced by the re-ranker should be calibrated. Calibration allows the final NLU interpretation score to be comparable across domains. We propose a novel re-ranker strategy that addresses these aspects, while also maintaining domain specific modularity. We design optimization loss functions for such a modularized re-ranker and present results on decreasing the top hypothesis error rate as well as maintaining the model calibration. We also experiment with an extension involving training the domain specific re-rankers on datasets curated independently by each domain to allow further asynchronization. %The proposed re-ranker design showcases the following: (i) improved NLU performance over an unweighted aggregation strategy, (ii) cross-domain calibrated performance and, (iii) support for use cases involving training each re-ranker on datasets curated by each domain independently.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (37)

View on Semantic Scholar

Summary

The paper presents a novel re-ranker strategy that integrates domain-specific DC, IC, and NER models to improve hypothesis ranking.
It combines Expected SemER and Cross-Entropy loss functions to optimize ranking correctness and score calibration across domains.
Experimental results demonstrate that the R3 optimization method outperforms baselines, ensuring scalable asynchronous training in NLU systems.

A Re-ranker Scheme for Integrating Large Scale NLU Models

This paper introduces a novel re-ranker strategy designed to integrate large-scale NLU models, emphasizing modularity, cross-domain hypothesis ranking, and score calibration. The approach leverages domain-specific components to maintain scalable and asynchronous model training while achieving cross-domain comparability through calibrated interpretation scores.

NLU System Design

The design of the proposed NLU system involves a modular approach, where each domain is encapsulated by three statistical models: a Domain Classifier (DC), an Intent Classifier (IC), and a Named Entity Recognizer (NER). Each domain generates hypotheses by aggregating outputs from its respective DC, IC, and NER.

Figure 1: A description of a modularized NLU system design containing DC (One Vs All MaxEnt), IC (Multi class MaxEnt), and NER (CRF model) models. The re-ranker is proposed as the combination scheme to obtain a ranked list of NLU hypotheses.

These models produce domain-specific scores, used as input for a re-ranker model, which ranks hypotheses based on correctness and calibration across domains.

Re-ranker Design and Error Metrics

The re-ranker maintains modularity by employing domain-specific models and optimizing them using custom loss functions designed to ensure hypothesis correctness and score calibration. Two primary error metrics are introduced:

Semantic Error Rate (SemER): Calculated using Levenshtein distance on IC and NER labels to assess correctness among in-domain hypotheses.
Interpretation Error (IE): A binary metric indicating exact hypothesis correctness with respect to ground truth.
Figure 2: An example of SemER computation for hypotheses from the Books domain component. Note that the DC output is not used to compute Levenshtein distance (LD). N is the number of slots in the annotations.

The re-ranker model is trained using linear functions of DC, IC, and NER scores with various loss functions that incorporate expected SemER and cross-entropy losses for better rank and calibration performance.

Re-ranker Training and Optimization

The re-ranker's objective is to achieve optimal hypothesis selection and score calibration across domains. Three optimization schemes are presented:

R1: Utilizes Expected SemER loss, focusing on ranking correctness.
R2: Employs Cross-Entropy loss for calibration, addressing score comparability.
R3: Combines both R1 and R2 losses, yielding the best outcome in terms of ranking and calibration balance.

These strategies allow the system to perform well even when domains are trained asynchronously or with different data subsets.

Experimental Results and Calibration

Experiments on an Alexa-enabled device with a large annotated dataset show significant performance enhancements using this re-ranker model. The study highlights improvements in aggregated SemER using R3 optimization, outperforming the baseline and other re-ranking strategies.

Figure 3: Confidence calibration plots obtained on hypotheses scores across all domains. While baseline and R1 optimization do not yield a good calibration, optimization using R2 and R3 provide the best calibration.

The model's calibration properties are evaluated using reliability curves, demonstrating improved cross-domain score alignment critical for downstream tasks and less error in threshold-based decisions.

Conclusion

The proposed re-ranker scheme underscores the importance of maintaining modularity and calibration in large-scale NLU models, allowing effective cross-domain integration and independent updates. With substantial improvements in SemER measures and reliable hypothesis ranking, this scheme sets a foundation for future developments in scalable NLU systems. Future work may explore enhanced feature integration and non-linear scoring functions to further refine hypothesis selection accuracy.

Markdown Report Issue