Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction (2502.11946v2)

Published 17 Feb 2025 in cs.CL, cs.AI, cs.HC, cs.SD, and eess.AS

Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Summary

The paper introduces Step-Audio, an open-source framework that unifies speech understanding and generation using a dual-codebook tokenizer and a 130B-parameter LLM.
Step-Audio employs a multi-modal dataset and methods like RLHF for AQTA chat, achieving improved ASR CER and state-of-the-art TTS results on open-source benchmarks.
The framework enables real-time tool calling and demonstrates superior performance in dialogue and instruction following evaluated on the StepEval-Audio-360 benchmark.

The paper introduces Step-Audio, a production-ready open-source framework designed for intelligent speech interaction, emphasizing unified understanding and generation. It addresses limitations in current open-source models related to voice data collection costs, dynamic control weaknesses, and intelligence constraints. The system integrates comprehension and generation through several key innovations: a large multi-modal model, a generative data engine, granular voice control, and enhanced intelligence through tool calling and role-playing.

The architecture of Step-Audio primarily consists of a speech tokenizer, a LLM, and a speech decoder. The framework employs an audio input, text output (AQTA) + Text-to-Speech (TTS) approach for real-time voice dialogue, citing the scarcity of high-quality pure-voice dialogue data and the need for customizable output speech. A dual-codebook tokenization framework, utilizing parallel linguistic (16.7Hz, 1024-codebook) and semantic (25Hz, 4096-codebook) tokenizers with 2:3 temporal interleaving, is used. The system leverages a 130B-parameter LLM and a hybrid speech synthesizer combining flow matching and neural vocoder. A Voice Activity Detection (VAD) module is employed to extract vocal segments.

The dual-codebook speech tokenizer framework employs two distinct tokenizers, linguistic and semantic, to represent speech features. The linguistic tokenizer extracts structured, high-level representations, including phonemic and linguistic features, at a rate of 16.7 Hz using a codebook size of 1024, leveraging the output from Paraformer. The semantic tokenizer encodes semantic and coarse-grained acoustic characteristics at a rate of 25 Hz with a larger codebook size of 4096, employing CosyVoice's tokenizer. Token-level interleaving is implemented with a temporal alignment ratio of 2:3.

For real-time interactions, the system uses an optimized inference pipeline. The Controller module manages state transitions and speculative response generation. Subsystems include VAD, the Streaming Audio Tokenizer, the Step-Audio LLM and Speech Decoder, and the Context Manager. Speculative response generation preemptively generates responses to reduce interaction latency. Text transcription is used for historical context, with ASR asynchronously transcribing user speech into text.

The multi-modal pretraining dataset integrates audio, text, and images. The audio section includes 1.1 trillion tokens of audio continuation data, 113 billion tokens of TTS synthesized speech data, 105 billion tokens of ASR data, and 350 billion tokens of audio-text alternating data. The text data amounts to 800 billion tokens, and the image section includes 800 billion tokens of image-text paired/alternating data. The system uses StarWeaver, an RPC-based distributed data processing library, and disaggregated model placement that allocates dedicated resources and employs tailored parallelism strategies for each sub-model.

The synthetic data-driven framework for TTS systems comprises three key components: a Step-2 LLM for generating textual content, a pre-trained Step-Audio model checkpoint, and an Audio-Edit Model for generating emotional expressions and speaking styles. The sft format comprises a system prompt, human input, and assistant response in a two-turn dialogue configuration. Instruction tags are classified into descriptive and comparative tags.

Reinforcement Learning from Human Feedback (RLHF) is applied for the AQTA task, leading to the creation of the Step-Audio-Chat model. The SFT data is categorized into TQTA, AQTA, and TAQTA types. The reward model training involves TQTA single-modal preference model pretraining, followed by AQTA cross-modal fine-tuning. The PPO algorithm is employed, and measures are taken to mitigate the "deaf hacking" phenomenon.

The StepEval-Audio-360 benchmark covers language proficiency, emotional intelligence, logical reasoning, creativity, multi-instruction following, role-playing, and safety. The indicator system combines quantitative analysis, LLM evaluation, and human evaluation.

In ASR validation experiments with a 3B model, the Character Error Rate (CER) of the Dual-Code approach improved from 25.5 to 18.4. In TTS evaluation, Step-Audio achieved the best CER and WER performance among open-source spoken models on the SEED TTS test dataset, with the 3-Billion parameter version of Step-Audio-TTS-3B achieving SoTA results in terms of CER and WER among the open-source models. For AQTA Chat, evaluations using the StepEval-Audio-360 benchmark, with scores automatically assessed by GPT-4o, indicated that Step-Audio-Chat demonstrated superior performance in real-time dialogue. In audio instruction following, Step-Audio-Chat showed competitive results in instruction following accuracy and audio quality. The system enables real-time tool call during voice interactions, decoupling text-based tool processing from audio generation pipelines.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - stepfun-ai/Step-Audio (767 stars)

Tweets

https://twitter.com/ArxivSound/status/1892047540981297630

https://twitter.com/ZanistaAI/status/1893757458243272786

YouTube

Show All Videos

HackerNews

Step-Audio: Apache 2.0-licensed end-to-end voice model (3 points, 0 comments)
Step-Audio: Framework for intelligent speech interaction (2 points, 0 comments)