Papers
Topics
Authors
Recent
2000 character limit reached

StylePitcher: Expressive Pitch Generator

Updated 27 October 2025
  • StylePitcher is a universal pitch curve generator that captures fine-grained singer expressiveness and adapts to various singing tasks without requiring retraining.
  • It uses a rectified flow matching architecture to map simple prior noise to authentic pitch curves, incorporating symbolic scores and pitch context for conditional infilling.
  • Evaluated on tasks like APC, SVS, and SVC, StylePitcher outperforms baseline methods by preserving stylistic features such as vibrato and ornamentation while maintaining pitch accuracy.

StylePitcher is a general-purpose pitch curve generator for singing voice applications that learns singer-specific style from reference audio and produces expressive, style-following pitch curves aligned with a given melody. It addresses two prevailing limitations of previous pitch curve systems: the lack of nuanced singer expressiveness and restricted generalization due to task-specific training. By employing a rectified flow matching architecture, StylePitcher incorporates symbolic music scores and contextual pitch information as conditioning signals, enabling flexible adaptation to various singing tasks—including pitch correction, singing voice synthesis (SVS), and singing voice conversion (SVC)—without retraining.

1. Motivation and Conceptual Foundations

The impetus behind StylePitcher emerges from limitations in extant pitch curve generators, which often disregard fine-grained expressive details—such as vibrato, glissandi, and idiosyncratic ornamentation—characteristic to specific performers. These methods commonly function as auxiliary modules tied to particular tasks (e.g., automatic pitch correction, SVS, or SVC), requiring separate retraining for each application and leading to poor generalization and inconsistent style preservation. StylePitcher aims to unify these functions by offering a task-agnostic, plug-and-play module. It learns the stylistic properties latent in reference audio but tightly aligns the generated pitch curves with target melodic intent, enabling reuse across multiple singing voice tasks.

2. Architecture: Rectified Flow Matching

At the core of StylePitcher is the rectified flow matching (RFM) framework, which deterministically transports pitch curve samples from a simple prior distribution π0\pi_0 (e.g., standard Gaussian) to the real data distribution π1\pi_1 (authentic pitch curves). The transformation is governed by an ODE:

dxt=vθ(xt,t,c) dtdx_t = v_{\theta}(x_t, t, c)\ dt

with

xt=(1−t) x0+t x1x_t = (1 - t)\ x_0 + t\ x_1

where x0∼π0x_0 \sim \pi_0, x1∼π1x_1 \sim \pi_1, t∈[0,1]t \in [0, 1], cc denotes conditional inputs (e.g., music score, pitch context), and vθv_\theta is the velocity field parameterized by a diffusion transformer. The learning objective is the flow-matching loss:

L(θ)=Ex0,x1,t,c∥vθ(xt,t,c)−(x1−x0)∥22L(\theta) = \mathbb{E}_{x_0, x_1, t, c} \| v_\theta(x_t, t, c) - (x_1 - x_0) \|_2^2

This structure allows StylePitcher to learn a continuous and invertible mapping between prior noise and style-preserving, contextually aligned pitch curves, facilitating high-quality generation with few sampling steps.

3. Conditioning on Symbolic Music and Pitch Context

StylePitcher supports conditional infilling, enabling the model to synthesize missing or to-be-modified segments of a pitch curve while strictly following musical context and target notes. Given:

  • a fundamental frequency (F0) curve x=(x1,...,xN)x = (x^1, ..., x^N)
  • a note sequence y=(y1,...,yN)y = (y^1, ..., y^N) from the symbolic score
  • a binary mask m∈{0,1}Nm \in \{0,1\}^N indicating masked (to-be-predicted) frames

the model conditions on xctx=(1−m)⊙xx_{\text{ctx}} = (1-m)\odot x (context), yy (full score), and optionally an unvoiced indicator uu to generate xmaskx_{\text{mask}} (inpainted pitch). The loss is then calculated over the masked entries:

Lpitch(θ)=Eϵ,p(x,y),t,m∥m⊙[vθ(xt,t,y,xctx)−(x−ϵ)]∥22L_{\text{pitch}}(\theta) = \mathbb{E}_{\epsilon, p(x, y), t, m} \| m \odot [v_\theta(x_t, t, y, x_{\text{ctx}}) - (x - \epsilon)] \|_2^2

This formulation ensures the output pitch not only carries singer style from surrounding context but also remains tightly aligned with symbolic melodic intent. The unvoiced indicator augments temporal alignment between generated pitch and phoneme timings.

4. Generality and Applicability to Singing Tasks

A salient property of StylePitcher is its ability to serve as a universal, task-agnostic pitch curve generator. By framing generation as a conditional infilling problem and integrating both pitch context and symbolic information, the same trained model can be invoked for:

  • Automatic Pitch Correction (APC): Correcting segments to match the musical score while maintaining stylistic expressiveness (e.g., vibrato).
  • Zero-shot SVS with Style Transfer: Synthesis of new singing performances in the style of a reference singer, with reference audio determining local stylistic details.
  • Style-Informed SVC: Transferring both timbre and singer style (expressive pitch characteristics), outperforming methods that naively reuse reference pitch curves.

No architectural changes or retraining are required for adaptation across these domains.

5. Evaluation: Objective and Subjective Performance

StylePitcher is benchmarked on the Chinese GTSinger dataset. Objective results demonstrate superior performance compared to established, task-specific baselines:

Task Metric StylePitcher Diff-Pitcher / StyleSinger (baseline)
APC RPA best or parity slightly higher pitch precision
RCA superior lower
OA superior lower
SVS Style similarity higher lower
Audio quality comparable comparable

A trained LSTM classifier achieves near-random (50%) discrimination accuracy between real and StylePitcher-generated curves, indicating generated outputs are perceptually close to natural. Subjective listening tests confirm improved expressiveness (e.g., vibrato, musical ornamentation) and style similarity, while maintaining pitch accuracy. Minor trade-offs in pitch correction precision for extremely strict tasks (as compared to Diff-Pitcher) are offset by better preservation of stylistic features.

6. Summary and Prospects

StylePitcher demonstrates that rectified flow, when augmented with conditional infilling and symbolic-musical alignment, is effective for expressive, style-preserving pitch curve generation across a diverse array of singing tasks. The design permits seamless, plug-and-play integration in APC, SVS, and SVC setups, eliminating the need for application-specific retraining.

Future directions proposed include development of content-aware generation that further leverages semantic and performance cues, and broadening the framework to encompass other facets of performance synthesis (e.g., loudness, timing, articulation). This suggests the potential for StylePitcher to form the basis of more comprehensive and holistic singing voice generation pipelines.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to StylePitcher.