Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Industry Scale Semi-Supervised Learning for Natural Language Understanding (2103.15871v1)

Published 29 Mar 2021 in cs.CL, cs.AI, and cs.LG

Abstract: This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how do the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-Label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Luoxin Chen (4 papers)
  2. Francisco Garcia (6 papers)
  3. Varun Kumar (35 papers)
  4. He Xie (8 papers)
  5. Jianhua Lu (28 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.