Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Purely End-to-end System for Multi-speaker Speech Recognition (1805.05826v1)

Published 15 May 2018 in cs.SD, cs.CL, eess.AS, and stat.ML

Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1 % relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hiroshi Seki (4 papers)
  2. Takaaki Hori (41 papers)
  3. Shinji Watanabe (416 papers)
  4. Jonathan Le Roux (82 papers)
  5. John R. Hershey (40 papers)
Citations (84)

Summary

We haven't generated a summary for this paper yet.