Audio2Vec: Spoken Word Embedding Framework
- Audio2Vec is a framework that produces fixed-dimensional vector representations for variable-length spoken word segments by combining phonetic structure with contextual semantics.
- It employs a two-stage architecture that first disentangles phonetic and speaker characteristics and then incorporates semantic context via a skip-gram objective.
- Empirical evaluations on the LibriSpeech corpus demonstrate its effectiveness in spoken content retrieval and cross-modal alignment with text embeddings.
The Audio2Vec framework provides a principled methodology for learning fixed-dimensional vector representations of variable-length spoken word segments. These representations are designed to jointly encode phonetic structure and distributional semantics, enabling applications such as spoken content retrieval, semantic search over audio, and cross-modal alignment with text-based embeddings. Audio2Vec formalizes and extends the analog of Word2Vec for speech by introducing neural architectures and training objectives that disentangle and then recombine phonetic, speaker, and semantic information from raw acoustic features (Chen et al., 2018).
1. Conceptual Foundations and Motivations
Audio2Vec arises from the desire to replicate the success of Word2Vec-style distributional semantic embeddings in the speech domain. Unlike text, spoken language presents challenges such as variable-length acoustic realizations, fine-grained phonetic variability, and nuisance attributes including speaker and channel effects. The Audio2Vec framework targets the extraction of embeddings that are robust to these factors, capturing both the phonetic constitution of a spoken segment and its contextual semantic relationships (Chen et al., 2018).
A distinctive attribute of Audio2Vec is its explicit two-stage disentanglement and recombination process: first, it isolates phonetic information invariant to speaker identity; subsequently, it infuses semantic information by contextualizing phonetic embeddings based on local utterance context, mirroring the skip-gram approach for distributional semantics.
2. Two-Stage Architecture: Phonetic and Semantic Embedding
Stage 1: Phonetic Embedding with Speaker Disentanglement
The first stage receives a frame-level representation of each spoken word segment, . This input is processed by two parallel 2-layer GRUs (each with hidden size 128):
- Phonetic Encoder (): Outputs , intended to capture pure phonetic content.
- Speaker Encoder (): Outputs , encoding speaker characteristics.
A two-layer GRU decoder (hidden size 256) reconstructs the input segment as .
An adversarial speaker discriminator, (3-layer feedforward, [128,128]), encourages to produce phonetic embeddings that cannot reveal speaker identity by maximizing the loss with respect to and minimizing with respect to 0. Regularization 1 pushes embeddings of the same speaker together and different speakers apart, using a margin 2. The overall phonetic loss combines reconstruction, regularization, and adversarial components:
3
with hyperparameters 4, 5.
Stage 2: Semantic Embedding via Contextual Skip-gram
Frozen phonetic embeddings 6 serve as input. Two separate 2-layer fully-connected networks (7 and 8, both with [256, 256] hidden sizes, output dimension 128) project these into semantic/target vectors 9 and context vectors 0.
A skip-gram objective is imposed: 1 where 2 indexes the context window (size 5 on each side) and 3 are negative samples (5 per positive pair).
The semantic encoder injects distributional semantic structure “over” the speaker-invariant phonetic space, yielding embeddings that capture both phonetic similarity and contextual meaning (Chen et al., 2018).
3. Training Protocol and Architectural Hyperparameters
Audio2Vec is trained on the LibriSpeech corpus (960 hours) using 39-dimensional MFCCs with forced-alignment word segmentation. The complete pipeline uses:
- GRU size for 4 and 5: 128 per layer (2 layers each).
- GRU decoder: 256 per layer (2 layers).
- Feed-forward net (stage 2): [256, 256] 6 128.
- Embedding dimensions: 7 (phonetic), 8 (speaker), 9 (semantic).
- Skip-gram window: 0, negatives per positive: 5.
- ADAM optimizer, learning rate 1, batch size 200 per stage.
- Hyperparameters (2) tuned on development data.
The adversarial disentanglement ensures that phonetic representations are stable with respect to nuisance factors prior to semantic embedding.
4. Disentanglement of Phonetic, Speaker, and Semantic Content
Speaker disentanglement leverages the orthogonality of phonetic and speaker subspaces:
- 3 is trained with 4 to cluster same-speaker embeddings and repel different-speaker ones.
- 5 attempts to recover speaker identity from 6; adversarial training of 7 enforces speaker-invariance.
- Semantic context is imposed only after 8 is stabilized, avoiding reintroduction of speaker-specific variance.
The skip-gram negative sampling loss overlays semantic structure on the phonetic backbone, but introduces a trade-off: improved semantic relationships come at a marginal cost to the purity of phonetic distinctions (top-1 phonetic retrieval accuracy drops from 0.637 to 0.598 when full semantic training is applied) (Chen et al., 2018).
5. Parallelization and Alignment with Text Embeddings
Audio2Vec embeddings can be mapped into text embedding spaces to facilitate cross-modal retrieval and alignment:
- Spoken-word audio embeddings 9 (mean-pooled per word type) and pretrained text embeddings 0 are projected via PCA (top 1 components).
- Two linear maps 2 are optimized to minimize
3
with 4.
This mapping supports applications such as aligning spoken content with text queries and facilitates evaluation via nearest neighbor accuracy (5, 6 for phonetic+semantic embeddings) (Chen et al., 2018).
6. Evaluation and Empirical Results
Audio2Vec is validated on large-scale spoken document retrieval:
- Spoken query and archive: LibriSpeech books segmented into 5,466 chapters.
- Retrieval: ranking chapters using 7, where 8 is the final audio embedding.
- Mean Average Precision (MAP) for retrieval: 9 (phonetic+semantic) vs 0 (phonetic only); on semantic-only ground truth, 1 vs 2.
- Examples demonstrate that top hits for the query “nations” include chapters about “king,” capturing semantic relatedness even in the absence of explicit keyword matches.
- Nearest-neighbor analysis shows the balance of phonetic and semantic grouping (e.g., “owned” retrieves “own, only, unknown, owner, land, …”).
These results establish that Audio2Vec enables semantic retrieval in the spoken domain, not just literal phonetic matching.
7. Discussion, Limitations, and Future Work
The principal limitation of the Audio2Vec instantiation described is its reliance on forced-alignment for gold-standard word segmentation. Integrating unsupervised segmentation strategies is proposed for future work. The negative-sampling skip-gram framework, while effective, may be further optimized for scalability with larger vocabularies or via graph-based objectives. Embedding models could be contextualized beyond the word level (sentence, utterance) to address polysemy and discourse-level semantics. Scaling, efficiency, and robustness to real-world variation remain active topics for ongoing research.
Audio2Vec constitutes a comprehensive blueprint for phonetic-and-semantic embedding of spoken words that rigorously addresses the complexities of the speech modality, demonstrating robust performance and laying the groundwork for broad cross-modal retrieval and understanding (Chen et al., 2018).