Instruction-Phoneme Input Format

Updated 29 November 2025

Instruction-phoneme input format is a unified multi-modal representation that combines natural language instructions with detailed phoneme streams, bridging acoustic and orthographic modeling.
It employs standardized preprocessing, tokenization, and embedding techniques to support transformer-based multi-task learning across speech, text, and music domains.
Empirical results indicate minimal degradation in text-based tasks while significantly boosting phonological discrimination and error-robust ASR performance.

The instruction-phoneme input format is a unified, multi-modal sequence representation for neural models that integrates both natural-language instructions and phoneme streams. This format underpins contemporary advances in phoneme-based pre-training, multimodal audio generation, and robust ASR-augmented language modeling by standardizing how linguistic instructions and fine-grained phonetic content are structured for transformer architectures. It functions as a bridge between orthographic and acoustic modeling, supporting applications ranging from direct speech and music generation to error-robust semantic modeling.

1. Conversion of Orthographic Input to Phoneme Stream

Initial preprocessing involves normalization of raw text by lower-casing and removal of extraneous spaces and formatting artifacts through regular expressions. All punctuation marks are stripped, reflecting the absence of these cues in continuous speech streams ("From Babble to Words" (Goriely et al., 30 Oct 2024)). Grapheme-to-phoneme conversion is typically achieved using packages such as Python “phonemizer” with espeak-ng’s dictionary and rule-based IPA transcription for American English. The processed output consists of discrete atomic IPA symbols (47 for English) separated by ASCII spaces; inter-word boundaries may be stripped or denoted (e.g., "|") depending on the desired format. For example, “Summarize the causes of the French Revolution.” is transformed into a phoneme stream such as:

1	s ʌ m ə ɹ aɪ z \| θ ə \| k o ɫ z \| ə v \| ð ə \| f ɹ ɛ n ʧ \| ɹ ɛ v ə ˈl u ʃ ə n

This approach precisely models natural spoken input and facilitates universal language representation (Goriely et al., 30 Oct 2024).

2. Tokenization and Special Delimiters

Tokenization is performed independently for each sequence—orthographic instruction or phoneme—using HuggingFace tokenizers, BPE, or character inventories. For phoneme-only models, the vocabulary is the set of phoneme symbols, with variants for subword and whitespace inclusion. Special boundary tokens such as UTT_BOUNDARY serve as start-of-utterance markers, appended to each sentence or evaluation instance, enabling explicit demarcation of utterances without relying on punctuation.

In joint models (PhonemeBERT (Sundararaman et al., 2021)) or multimodal generation frameworks (InstructAudio (Qiang et al., 23 Nov 2025)), both instruction and phoneme sequences are tokenized using potentially distinct vocabularies (e.g., Qwen2.5 BPE for instructions, ∼70–100 phonemes for speech/music tasks). Speaker tokens ([S0], [S1]) may be used in dialogue generation to mark source identity (Qiang et al., 23 Nov 2025). Delimiters like <s> and </s> segment the input streams for BERT-based architectures, and <mask> tokens support masked language modeling.

3. Embedding and Positional Encoding Mechanisms

Each token in the sequence is mapped to an embedding vector of dimension $d$ via lookup tables:

Instruction tokens: $E_{\text{instr}}(t_i) \in \mathbb{R}^d$
Phoneme tokens: $E_{\text{phon}}(p_j) \in \mathbb{R}^d$

Positional encodings are added to distinguish token location within the concatenated input. In InstructAudio, rotary positional embeddings (RoPE) are applied across the full sequence length (Qiang et al., 23 Nov 2025). Segment embeddings $s_{\text{instr}}$ , $s_{\text{phon}}$ provide modality identification for joint learning.

The final embedding for token $i$ is given by: $x_i = E_{\text{token}}(i) + P(i) + s_{\text{segment}}$ where $E_{\text{token}}$ selects either instruction or phoneme embedding by type, and $P(i)$ is the positional encoding. PhonemeBERT explicitly resets position indices for the second stream to avoid overlap and foster cross-attention mediated alignment (Sundararaman et al., 2021).

4. Sequence Concatenation and Input Matrix Construction

The instruction-phoneme format concatenates the embedding representations from each sequence along the time axis, producing a matrix $X \in \mathbb{R}^{(L_1 + L_2) \times d}$ for single-sample input, or batch tensor $C_{\text{text}} \in \mathbb{R}^{B \times (L_1 + L_2) \times d}$ for batch input (Qiang et al., 23 Nov 2025). No further separator tokens are needed, since the model is conditioned on the known sequence lengths. For transformer models, this architecture enables unified input for multi-task and cross-modal joint training, with attention blocks spanning both instruction and phoneme embeddings.

For continuous phoneme-only modeling (“From Babble to Words” (Goriely et al., 30 Oct 2024)), the input stream is flattened into fixed-length chunks matching the context window (e.g., $L=128$ ), with boundary markers maintained and padding applied to the final chunk if necessary.

5. Masking, Segment Identification, and Special Cases

Standard masked language modeling (MLM) is performed on both instruction and phoneme tokens. In PhonemeBERT (Sundararaman et al., 2021), 15% of non-special tokens in each stream are masked, with 80% replaced by the mask token, 10% left unaltered, and 10% replaced by a random token of the same modality. Loss functions are computed separately for both streams as well as jointly. Segment identification is accomplished by segment embeddings (binary or vector), enabling the transformer architecture to distinguish source modalities throughout processing.

In dialogue and multi-speaker scenarios, prepended speaker tokens provide explicit voice attribution for each instruction-phoneme subsequence (Qiang et al., 23 Nov 2025).

6. Empirical Implications and Model Capabilities

Instruction-phoneme input formats confer several empirical advantages. Phoneme-based pre-training incurs only a minor performance degradation (1–4% relative) on text-centric language understanding benchmarks (BLiMP, GLUE), despite elimination of punctuation and word boundaries (Goriely et al., 30 Oct 2024). In contrast, domains requiring phonological discrimination—such as minimal-pair lexical tasks—see phonemic models outperform orthographic ones, reaching up to 89.6% accuracy (“From Babble to Words” (Goriely et al., 30 Oct 2024)). Multi-modal formats such as InstructAudio facilitate unified speech and music generation with fine-grained control over acoustic and musical attributes through joint transformer attention (Qiang et al., 23 Nov 2025). PhonemeBERT’s joint modeling demonstrates robustness to ASR transcription errors and enhances downstream sentiment, question, and intent classification under noisy conditions (Sundararaman et al., 2021).

A plausible implication is that instruction-phoneme input formats support LLM generalization across languages, domains, and modalities without retraining task-specific tokenizers or word-piece inventories, and they enable direct comparison with human phonological processing.

7. Implementation Summary and Representative Examples

The construction follows these canonical steps, integrating details from each principal framework:

Normalize and clean raw text, strip punctuation.
Convert to phoneme stream via G2P or phonemizer tool.
Tokenize both instruction and phoneme streams (character, BPE, or SentencePiece).
Assign segment and position embeddings, reset position indices for stream alignment if required.
Concatenate embeddings to form unified input matrix, preserving modality information.
(If applicable) Insert boundary markers, speaker tokens, or masks for special modeling purposes.
Feed concatenated sequence into transformer-based architectures for pre-training, multi-task, or generative objectives.

For the prompt “Generate a calm female voice” and phoneme sequence /m ə g æ z i n/, tokenization, embedding, and concatenation yield a $13 \times 512$ matrix composed of both instruction and phoneme semantic features, ready for conditioning in diffusion transformer layers (Qiang et al., 23 Nov 2025).

The instruction-phoneme input format is thus a rigorous, implementation-ready paradigm that standardizes the joint handling of natural-language control and low-level phonetic input across modern language, speech, and music models.