Papers
Topics
Authors
Recent
Search
2000 character limit reached

Audio2Vec: Spoken Word Embedding Framework

Updated 9 April 2026
  • Audio2Vec is a framework that produces fixed-dimensional vector representations for variable-length spoken word segments by combining phonetic structure with contextual semantics.
  • It employs a two-stage architecture that first disentangles phonetic and speaker characteristics and then incorporates semantic context via a skip-gram objective.
  • Empirical evaluations on the LibriSpeech corpus demonstrate its effectiveness in spoken content retrieval and cross-modal alignment with text embeddings.

The Audio2Vec framework provides a principled methodology for learning fixed-dimensional vector representations of variable-length spoken word segments. These representations are designed to jointly encode phonetic structure and distributional semantics, enabling applications such as spoken content retrieval, semantic search over audio, and cross-modal alignment with text-based embeddings. Audio2Vec formalizes and extends the analog of Word2Vec for speech by introducing neural architectures and training objectives that disentangle and then recombine phonetic, speaker, and semantic information from raw acoustic features (Chen et al., 2018).

1. Conceptual Foundations and Motivations

Audio2Vec arises from the desire to replicate the success of Word2Vec-style distributional semantic embeddings in the speech domain. Unlike text, spoken language presents challenges such as variable-length acoustic realizations, fine-grained phonetic variability, and nuisance attributes including speaker and channel effects. The Audio2Vec framework targets the extraction of embeddings that are robust to these factors, capturing both the phonetic constitution of a spoken segment and its contextual semantic relationships (Chen et al., 2018).

A distinctive attribute of Audio2Vec is its explicit two-stage disentanglement and recombination process: first, it isolates phonetic information invariant to speaker identity; subsequently, it infuses semantic information by contextualizing phonetic embeddings based on local utterance context, mirroring the skip-gram approach for distributional semantics.

2. Two-Stage Architecture: Phonetic and Semantic Embedding

Stage 1: Phonetic Embedding with Speaker Disentanglement

The first stage receives a frame-level representation of each spoken word segment, xi=(xi,1,...,xi,T)\mathbf{x}_i = (x_{i,1}, ..., x_{i,T}). This input is processed by two parallel 2-layer GRUs (each with hidden size 128):

  • Phonetic Encoder (EpE_p): Outputs vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}, intended to capture pure phonetic content.
  • Speaker Encoder (EsE_s): Outputs vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}, encoding speaker characteristics.

A two-layer GRU decoder (hidden size 256) reconstructs the input segment as xi=Dec(vp,i,vs,i)x_i' = \mathrm{Dec}(v_{p,i}, v_{s,i}).

An adversarial speaker discriminator, DsD_s (3-layer feedforward, [128,128]), encourages EpE_p to produce phonetic embeddings that cannot reveal speaker identity by maximizing the loss LdL_d with respect to DsD_s and minimizing with respect to EpE_p0. Regularization EpE_p1 pushes embeddings of the same speaker together and different speakers apart, using a margin EpE_p2. The overall phonetic loss combines reconstruction, regularization, and adversarial components:

EpE_p3

with hyperparameters EpE_p4, EpE_p5.

Stage 2: Semantic Embedding via Contextual Skip-gram

Frozen phonetic embeddings EpE_p6 serve as input. Two separate 2-layer fully-connected networks (EpE_p7 and EpE_p8, both with [256, 256] hidden sizes, output dimension 128) project these into semantic/target vectors EpE_p9 and context vectors vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}0.

A skip-gram objective is imposed: vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}1 where vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}2 indexes the context window (size 5 on each side) and vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}3 are negative samples (5 per positive pair).

The semantic encoder injects distributional semantic structure “over” the speaker-invariant phonetic space, yielding embeddings that capture both phonetic similarity and contextual meaning (Chen et al., 2018).

3. Training Protocol and Architectural Hyperparameters

Audio2Vec is trained on the LibriSpeech corpus (960 hours) using 39-dimensional MFCCs with forced-alignment word segmentation. The complete pipeline uses:

  • GRU size for vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}4 and vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}5: 128 per layer (2 layers each).
  • GRU decoder: 256 per layer (2 layers).
  • Feed-forward net (stage 2): [256, 256] vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}6 128.
  • Embedding dimensions: vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}7 (phonetic), vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}8 (speaker), vp,i=Ep(xi)R128v_{p,i} = E_p(x_i) \in \mathbb{R}^{128}9 (semantic).
  • Skip-gram window: EsE_s0, negatives per positive: 5.
  • ADAM optimizer, learning rate EsE_s1, batch size 200 per stage.
  • Hyperparameters (EsE_s2) tuned on development data.

The adversarial disentanglement ensures that phonetic representations are stable with respect to nuisance factors prior to semantic embedding.

4. Disentanglement of Phonetic, Speaker, and Semantic Content

Speaker disentanglement leverages the orthogonality of phonetic and speaker subspaces:

  • EsE_s3 is trained with EsE_s4 to cluster same-speaker embeddings and repel different-speaker ones.
  • EsE_s5 attempts to recover speaker identity from EsE_s6; adversarial training of EsE_s7 enforces speaker-invariance.
  • Semantic context is imposed only after EsE_s8 is stabilized, avoiding reintroduction of speaker-specific variance.

The skip-gram negative sampling loss overlays semantic structure on the phonetic backbone, but introduces a trade-off: improved semantic relationships come at a marginal cost to the purity of phonetic distinctions (top-1 phonetic retrieval accuracy drops from 0.637 to 0.598 when full semantic training is applied) (Chen et al., 2018).

5. Parallelization and Alignment with Text Embeddings

Audio2Vec embeddings can be mapped into text embedding spaces to facilitate cross-modal retrieval and alignment:

  • Spoken-word audio embeddings EsE_s9 (mean-pooled per word type) and pretrained text embeddings vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}0 are projected via PCA (top vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}1 components).
  • Two linear maps vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}2 are optimized to minimize

vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}3

with vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}4.

This mapping supports applications such as aligning spoken content with text queries and facilitates evaluation via nearest neighbor accuracy (vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}5, vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}6 for phonetic+semantic embeddings) (Chen et al., 2018).

6. Evaluation and Empirical Results

Audio2Vec is validated on large-scale spoken document retrieval:

  • Spoken query and archive: LibriSpeech books segmented into 5,466 chapters.
  • Retrieval: ranking chapters using vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}7, where vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}8 is the final audio embedding.
  • Mean Average Precision (MAP) for retrieval: vs,i=Es(xi)R128v_{s,i} = E_s(x_i) \in \mathbb{R}^{128}9 (phonetic+semantic) vs xi=Dec(vp,i,vs,i)x_i' = \mathrm{Dec}(v_{p,i}, v_{s,i})0 (phonetic only); on semantic-only ground truth, xi=Dec(vp,i,vs,i)x_i' = \mathrm{Dec}(v_{p,i}, v_{s,i})1 vs xi=Dec(vp,i,vs,i)x_i' = \mathrm{Dec}(v_{p,i}, v_{s,i})2.
  • Examples demonstrate that top hits for the query “nations” include chapters about “king,” capturing semantic relatedness even in the absence of explicit keyword matches.
  • Nearest-neighbor analysis shows the balance of phonetic and semantic grouping (e.g., “owned” retrieves “own, only, unknown, owner, land, …”).

These results establish that Audio2Vec enables semantic retrieval in the spoken domain, not just literal phonetic matching.

7. Discussion, Limitations, and Future Work

The principal limitation of the Audio2Vec instantiation described is its reliance on forced-alignment for gold-standard word segmentation. Integrating unsupervised segmentation strategies is proposed for future work. The negative-sampling skip-gram framework, while effective, may be further optimized for scalability with larger vocabularies or via graph-based objectives. Embedding models could be contextualized beyond the word level (sentence, utterance) to address polysemy and discourse-level semantics. Scaling, efficiency, and robustness to real-world variation remain active topics for ongoing research.

Audio2Vec constitutes a comprehensive blueprint for phonetic-and-semantic embedding of spoken words that rigorously addresses the complexities of the speech modality, demonstrating robust performance and laying the groundwork for broad cross-modal retrieval and understanding (Chen et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Audio2Vec Framework.