PASQA: Pitch-Accent Speech Quality Assessment

Updated 4 July 2026

PASQA is a model designed to assess lexical pitch accent correctness in Japanese speech by using a pseudo accent-quality score derived from synthetic accent-error perturbations.
It integrates wav2vec 2.0 acoustic features with mora-conditioned fusion and a Transformer encoder to capture language-specific prosodic patterns.
PASQA outperforms traditional MOS predictors by accurately correlating with human judgments on accent errors and highlighting localized prosodic issues.

Pitch-Accent-focused Speech Quality Assessment (PASQA) is a non-intrusive speech assessment model designed to score pitch-accent correctness in Japanese speech at the utterance level, rather than overall naturalness or generic mean opinion score (MOS). In the formulation introduced in "PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors" (Kawamura et al., 18 Jun 2026), the model takes a waveform together with text-derived mora information and outputs an accent-quality score in $[1,5]$ , where higher values indicate a pitch-accent pattern closer to the correct pattern for the sentence. PASQA emerged in response to a broader finding in speech quality assessment: conventional MOS predictors can be effectively blind to localized pitch-accent errors even when native listeners strongly penalize them (Takagi et al., 18 Jun 2026).

1. Definition and scope

PASQA targets a specific quality dimension: lexical pitch accent correctness. In the Japanese setting used by the model, pitch accent is defined over accent phrases and mora, and correctness depends on whether the realized accent phrase boundaries and accent nucleus positions match the standard Tokyo Japanese pattern predicted for the text (Kawamura et al., 18 Jun 2026). This differs from broader speech quality assessment frameworks that aggregate signal quality, naturalness, intelligibility, and speaker similarity into a single score or into multiple scalar metrics. Uni-VERSA, for example, explicitly includes noise, naturalness, intelligibility, speaker characteristics, and a single prosody metric, F0-CORR, but does not define any specific metric for pitch accent, tone, stress, or rhythm (Shi et al., 27 May 2025).

The distinction between accent correctness and general naturalness is central. Conventional MOS models are trained to approximate overall naturalness or quality, whereas PASQA is explicitly trained on a pseudo objective derived from an accent-error rate. In the PASQA formulation, a sentence may receive a high global naturalness score from a generic MOS model even when its pitch accent is lexically wrong, because other dimensions such as fluency or voice quality dominate the prediction (Kawamura et al., 18 Jun 2026). The human–model discrepancy study on Japanese TTS perturbations makes the same point from another angle: all evaluated MOS models were effectively insensitive to controlled pitch-accent errors despite large subjective score drops (Takagi et al., 18 Jun 2026).

A plausible implication is that PASQA represents a shift from general-purpose MOS prediction toward language-specific prosodic quality assessment. This interpretation is consistent with the observation that Japanese listeners are highly sensitive to accent correctness and that pitch-accent correctness should be evaluated separately from generic naturalness (Yasuda et al., 2022).

2. Motivation and problem setting

The immediate motivation for PASQA is the demonstrated mismatch between human judgments and generic quality predictors on prosodic distortions. In the controlled Japanese perturbation study, human MOS dropped from $4.00 \pm 0.07$ for error-free TTS to $3.19 \pm 0.09$ under low accent-error rates and to $2.16 \pm 0.09$ under high accent-error rates, while no evaluated model changed by more than $0.1$ points across the three conditions (Takagi et al., 18 Jun 2026). The same study also reported a double dissociation for speaker characteristics: humans were sensitive to F0 variability and speaking rate, but most models were instead strongly biased by mean F0, a factor not strongly associated with human ratings (Takagi et al., 18 Jun 2026). PASQA was proposed against this background.

The problem is especially acute in Japanese because lexical pitch accent is both localized and linguistically consequential. The PASQA paper characterizes accent correctness in terms of the accent nucleus, the mora where the pitch falls within an accent phrase, and emphasizes that a single-mora shift can change lexical meaning (Kawamura et al., 18 Jun 2026). Related Japanese TTS work likewise treats incorrect pitch accent as a persistent failure mode of end-to-end synthesis, noting that listeners assign relatively low accent-correctness MOS even when systems are otherwise fairly natural acoustically (Yasuda et al., 2022).

PASQA therefore addresses a failure mode of mainstream speech assessment. Generic multi-metric models such as Uni-VERSA can predict many metrics simultaneously and include a prosody-related target, but their explicit prosodic target remains F0 Pearson correlation coefficient (F0-CORR) between enhanced and reference speech (Shi et al., 27 May 2025). PASQA instead centers the assessment problem on accent nucleus errors, accent phrase corruption, and severity ordering constructed specifically for Japanese pitch accent (Kawamura et al., 18 Jun 2026).

3. Synthetic accent-error corpus and pseudo accent-quality target

PASQA is trained on a synthetic Japanese accent-error dataset generated with a NANSY-TTS–based Japanese TTS system trained on an internal prosodically annotated corpus of 173,987 samples, 207.96 hours, 17 speakers (Kawamura et al., 18 Jun 2026). A DNN-based prosodic label prediction model provides, for each text sentence, the mora sequence, accent phrase boundaries, and accent nucleus positions. These labels define the canonical accent pattern used for synthesis (Kawamura et al., 18 Jun 2026).

To create accent errors, PASQA modifies the accent nucleus positions in a subset of accent phrases. If an utterance has $P$ accent phrases and a target accent-error rate $r$ , the data generation procedure samples $\max(1, \lfloor rP \rfloor)$ phrases uniformly at random and re-samples each selected phrase’s accent type from $\{0,1,\ldots,L-1\}$ excluding the original type for a phrase of length $L$ (Kawamura et al., 18 Jun 2026). The resulting actual accent-error rate is defined over mora coverage rather than phrase count: if $4.00 \pm 0.07$ 0 mora belong to modified accent phrases and $4.00 \pm 0.07$ 1 is the total mora count, then the pseudo accent-quality score is

$4.00 \pm 0.07$ 2

This mapping anchors the target in the MOS-like range $4.00 \pm 0.07$ 3: $4.00 \pm 0.07$ 4 for error-free utterances and $4.00 \pm 0.07$ 5 when all mora belong to corrupted phrases (Kawamura et al., 18 Jun 2026).

The training data are generated in three severity conditions per text–speaker pair: error-free with $4.00 \pm 0.07$ 6, low severity with $4.00 \pm 0.07$ 7, and high severity with $4.00 \pm 0.07$ 8 (Kawamura et al., 18 Jun 2026). The resulting corpus uses 91,157 sentences and 13 speakers, with 2,130,858 samples, 2,898.79 hours in training and dedicated seen-speaker and unseen-speaker test sets (Kawamura et al., 18 Jun 2026).

This design is directly related to the perturbation methodology used to expose generic MOS-model failures. That earlier study also created Japanese TTS with accent errors by segmenting sentences into accentual phrases and performing binary pitch-accent flip: high ↔ low within the phrase, with None, Low, and High conditions (Takagi et al., 18 Jun 2026). PASQA can be understood as converting that kind of controlled perturbation paradigm into a supervised assessment model.

4. Architecture and linguistic conditioning

PASQA builds on the SSL-MOS architecture with a wav2vec 2.0 acoustic backbone and then augments it with components specialized for Japanese pitch accent (Kawamura et al., 18 Jun 2026). The waveform input is sampled at 16 kHz, passed through wav2vec 2.0 to obtain frame-level features, and then processed by several task-specific modules (Kawamura et al., 18 Jun 2026).

The first specialized component is mora-conditioned fusion. From text, PASQA extracts a mora sequence $4.00 \pm 0.07$ 9, embeds each mora into a 256-dimensional vector, and contextualizes the sequence with a 1-layer Transformer encoder with 4 attention heads, FFN dimension 512, dropout 0.1, and rotary positional encoding (RoPE) (Kawamura et al., 18 Jun 2026). These mora representations are then fused with the acoustic frame sequence by cross-attention, where acoustic frames supply queries and mora representations supply keys and values. This architecture encodes a specifically Japanese prior: pitch accent is defined at the mora level, so acoustic evidence is conditioned on mora structure rather than on a generic text embedding (Kawamura et al., 18 Jun 2026).

The utterance-level scoring head follows a standard SSL-MOS pattern: mask-aware mean pooling converts the fused frame sequence into a single utterance embedding, then a 2-layer MLP with hidden size 64 produces a scalar score, followed by a tanh transform into the target range $3.19 \pm 0.09$ 0 (Kawamura et al., 18 Jun 2026). PASQA also adds an auxiliary frame error head that predicts whether each frame belongs to a phrase whose accent nucleus was altered, using frame labels derived by aligning prosodic labels to audio through the TTS duration predictor (Kawamura et al., 18 Jun 2026).

A further component is speaker-invariant training. PASQA attaches a speaker classifier to the utterance embedding through a gradient reversal layer (GRL), so that the shared representation is optimized to confuse speaker identification while preserving accent-quality information (Kawamura et al., 18 Jun 2026). This component is intended to reduce reliance on speaker-dependent cues that generic quality models often misuse. The human–model discrepancy study makes the relevance of this design clear: mainstream MOS predictors showed strong biases toward mean F0, whereas human ratings did not (Takagi et al., 18 Jun 2026).

This architecture also places PASQA within a broader line of pitch-accent and prosody modeling. The Korean Dual-Glob framework, for example, showed that full-contour supervised contrastive learning over normalized $3.19 \pm 0.09$ 1 trajectories can recover 16 AP tonal categories from continuous contours (Joo et al., 21 Apr 2026). Japanese PnG BERT work, by contrast, emphasized that grapheme context is essential for phrase-level prosody and that auxiliary tone prediction makes accent nucleus information much more linearly recoverable from encoder representations (Yasuda et al., 2022). PASQA combines these impulses: it is acoustic at inference time in the sense of scoring speech, but it uses explicit linguistic structure through mora-conditioned fusion.

5. Objectives, optimization, and evaluation protocol

PASQA is trained with a weighted multi-objective loss. The main ranking component is a pairwise logistic ranking loss based on the Bradley–Terry model. If $3.19 \pm 0.09$ 2 are the pseudo accent-quality scores and $3.19 \pm 0.09$ 3 are predictions, PASQA defines

$3.19 \pm 0.09$ 4

and minimizes

$3.19 \pm 0.09$ 5

This is combined with an utterance-level L1 regression loss, a frame-level binary cross entropy localization loss, and a speaker-classification cross-entropy loss connected through the GRL (Kawamura et al., 18 Jun 2026). The total objective is

$3.19 \pm 0.09$ 6

with $3.19 \pm 0.09$ 7, $3.19 \pm 0.09$ 8, $3.19 \pm 0.09$ 9, and $2.16 \pm 0.09$ 0 (Kawamura et al., 18 Jun 2026). The GRL schedule is

$2.16 \pm 0.09$ 1

with $2.16 \pm 0.09$ 2 and normalized training progress $2.16 \pm 0.09$ 3 (Kawamura et al., 18 Jun 2026).

Evaluation centers on ordering by accent-error severity. Each text–speaker triplet contains error-free, low-severity, and high-severity samples, and Order Accuracy measures the fraction of triplets for which the model preserves the correct ordering

$2.16 \pm 0.09$ 4

Correlation with the pseudo accent-quality score is measured using LCC, SRCC, and KTAU (Kawamura et al., 18 Jun 2026). The paper also evaluates agreement with human ratings from 15 native Japanese speakers, who rate 120 speech samples on a 5-point scale based on pitch accent naturalness in the Tokyo dialect (Kawamura et al., 18 Jun 2026).

The core quantitative result is that PASQA substantially outperforms both conventional MOS models and simpler accent-error-trained baselines on severity ordering and correlation. On the controlled accent-error test, generic models such as DNSMOS, NISQA, UTMOS, and UTMOSv2 achieve order accuracy close to random and correlations near zero, whereas PASQA reaches 0.754 order accuracy and 0.829 LCC on seen speakers, and 0.785 order accuracy and 0.879 LCC on unseen speakers (Kawamura et al., 18 Jun 2026). On the human-rating test, PASQA achieves the highest reported correlation with human accent-correctness judgments, with LCC 0.814, SRCC 0.828, and KTAU 0.614 (Kawamura et al., 18 Jun 2026).

6. Relation to broader pitch-accent and prosody research

PASQA sits at the intersection of speech quality assessment and pitch-accent modeling. On the assessment side, it can be viewed as a specialized response to the limitations of generic MOS prediction. The perturbation study on human–model discrepancies showed that most models track acoustic degradation well but are insensitive to prosodic errors, including Japanese pitch-accent manipulation (Takagi et al., 18 Jun 2026). Uni-VERSA addresses the multidimensionality of speech quality by predicting many objective metrics simultaneously, yet its explicit prosody target remains F0-CORR, which the authors themselves describe as not capturing duration or stress patterns (Shi et al., 27 May 2025). PASQA narrows the target space and defines a dedicated label construction and loss design for accent correctness (Kawamura et al., 18 Jun 2026).

On the linguistic and prosodic side, PASQA is consistent with work showing that pitch accent is both structurally constrained and context-dependent. In Japanese TTS, PnG BERT experiments found that pretraining yields representations helpful for inferring pitch accent, that grapheme context is critical, and that auxiliary tone prediction substantially improves recoverability of accent nucleus information (Yasuda et al., 2022). In English, context-sensitive pitch accent detection improved when models used full utterances and an LSTM, and ablation studies showed that pitch is the most important acoustic feature for the task (Nielsen et al., 2020). In Seoul Korean, Dual-Glob demonstrated that holistic $2.16 \pm 0.09$ 5 contour representations can recover fine-grained AP tonal categories and become more accurate when combined with structural cues such as syllable count (Joo et al., 21 Apr 2026).

A further line of relevance comes from joint modeling. The ASR study "Pitch Accent Detection improves Pretrained Automatic Speech Recognition" trained a joint ASR + pitch accent detection model over wav2vec 2.0 representations and reported both improved accent detection and reduced ASR WER under limited-resource fine-tuning (Sasu et al., 6 Aug 2025). This suggests that explicit prosodic supervision can reshape SSL representations in ways useful for downstream tasks. PASQA applies the same general logic to assessment: rather than asking a generic encoder to discover pitch-accent sensitivity implicitly, it imposes that sensitivity through accent-error supervision, ranking constraints, localization, and speaker invariance (Kawamura et al., 18 Jun 2026).

This suggests a broader methodological pattern: prosody-aware representation learning becomes more effective when trained against explicit prosodic objectives rather than against scalar naturalness alone.

7. Limitations, assumptions, and future directions

PASQA’s training signal is synthetic and indirect. The training corpus is generated by a controllable TTS system, and the target score is a pseudo accent-quality score derived by a linear mapping from the mora-weighted accent-error rate rather than from human labels (Kawamura et al., 18 Jun 2026). The paper explicitly notes that real conversational speech with natural accent variation is not used for training, and that the pseudo score assumes each corrupted mora contributes equally and linearly to perceived accent quality (Kawamura et al., 18 Jun 2026). A plausible implication is that PASQA’s calibration may be strongest for controlled TTS-style accent errors and weaker for naturally occurring prosodic deviations.

The framework is also language-specific. PASQA is tightly coupled to a mora-based prosodic system and to a notion of accent nucleus in Japanese (Kawamura et al., 18 Jun 2026). The authors identify multilingual extension as future work, but such extension would require redefining the segmentation unit and the nature of “accent errors” for languages organized around lexical tone, stress accent, or other pitch-accent systems (Kawamura et al., 18 Jun 2026). Dual-Glob’s AP-based contour modeling for Seoul Korean and English word-level pitch accent detection provide possible templates for this redefinition, but they are not drop-in replacements because their label spaces and units differ (Joo et al., 21 Apr 2026, Nielsen et al., 2020).

Another limitation concerns dependence on upstream prosodic annotation and prediction. PASQA assumes that the DNN-based prosodic label predictor provides canonical accent patterns accurate enough to serve as reference structure (Kawamura et al., 18 Jun 2026). Errors in that predictor propagate into both synthetic corruption and training targets. The Japanese PnG BERT study identifies a related problem from the generation side: low lexical coverage in speech corpora makes pitch-accent generalization difficult even when text representations are strong (Yasuda et al., 2022). This suggests that PASQA-like models may benefit from better lexical and symbolic prosody resources, not only from more acoustic data.

Future directions stated or strongly implied in the literature include improving out-of-domain robustness, incorporating real human-labeled accent errors, and extending assessment beyond accent nucleus correctness to broader prosodic phenomena such as phrasing, rhythm, and intonational contour (Kawamura et al., 18 Jun 2026, Takagi et al., 18 Jun 2026). Uni-VERSA’s multi-metric formulation suggests another possibility: a future system could combine PASQA-style accent correctness with additional quality dimensions while keeping prosody as an explicitly supervised component rather than an implicit byproduct of naturalness prediction (Shi et al., 27 May 2025).