Impact of subword vs. phoneme text inputs on PM-RoPE duration control

Determine how the choice between phoneme sequences and SentencePiece subword tokens as text inputs affects the duration-control effectiveness of Progress-Monitoring Rotary Position Embedding (PM-RoPE) in encoder-decoder codec language models such as T5Gemma-TTS, ideally via a controlled ablation that isolates this factor.

Background

VoiceStar originally applied PM-RoPE using phoneme sequences, which provide a monotonic alignment between text and audio. T5Gemma-TTS instead uses T5Gemma's SentencePiece subword tokens to avoid language-specific phonemization and to leverage pretrained embeddings.

The authors note that they did not perform an ablation isolating the effect of this design difference on PM-RoPE’s duration control behavior, leaving uncertainty about whether subword inputs alter PM-RoPE’s effectiveness compared to phoneme inputs.

References

The effect of this phoneme-vs-subword choice on PM-RoPE's duration control effectiveness has not been ablated in this work and remains an open question for future investigation.

T5Gemma-TTS Technical Report  (2604.01760 - Arata et al., 2 Apr 2026) in Related Work, Section 2.2 (Encoder-Decoder Architectures in TTS)