Bezier Distillation

Updated 20 March 2026

Bezier Distillation is a knowledge distillation framework that uses Bezier-curve interpolation to integrate multi-teacher guidance in flow-based generative models.
It replaces straight-line ODE flows with multi-step Bezier trajectories, mitigating error accumulation and improving convergence.
The method enables efficient, high-fidelity mappings in tasks like image synthesis by reducing overall distillation error with fewer model iterations.

Bezier Distillation is a knowledge distillation framework for flow-based generative modeling that leverages Bezier-curve interpolation through intermediate “teacher” distributions to mitigate error accumulation in rectified flows. The method extends conventional rectified flow distillation—where straight-line ODE flows between a base distribution and a target are repeatedly composed and then distilled—by replacing the straight-line coupling with multi-step Bezier trajectories anchored by intermediate rectified flows. This approach allows the student model to efficiently acquire more accurate mappings between source and target distributions with reduced cumulative error, and supports multi-teacher guidance for improved convergence and sample quality (Feng et al., 20 Mar 2025).

1. Foundational Concepts and Motivation

Rectified Flow (Liu et al. 2022) is a family of continuous-time generative models that learn a transport ODE from a base noise distribution $T_0$ (e.g., Gaussian) to a target data distribution $T_1$ . The goal is to learn a coupling (transport map) $T:\mathbb{R}^d \to \mathbb{R}^d$ such that $T(X_0)\sim T_1$ given $X_0\sim T_0$ . The ODE is parameterized as

$\frac{dX_t}{dt} = v(X_t, t), \quad X_t = t X_1 + (1-t) X_0, \quad t\in[0,1]$

where $v$ is a neural “drift” network. The model is trained to minimize

$\min_v \int_{t=0}^1 \mathbb{E}_{X_0,X_1}\Big[\|X_1 - X_0 - v(X_t, t)\|^2\Big] dt$

constraining the flow’s tangent to match the straight-line difference $X_1-X_0$ along interpolated segments. Models can be further refined by applying $k$ sequential rectified flows, progressively straightening the induced coupling.

Rectified-flow distillation compresses multiple such rectifications into a single “student” network through supervised regression

$\min_v \mathbb{E}_{X_0, X_k}[\|X_k - X_0 - v(X_0, 0)\|^2]$

allowing direct prediction from $X_0$ to $X_k$ in one pass.

However, iterative rectification leads to error accumulation: the local ODE integration error ( $O(h^p)$ per step) and model approximation error compound as $k$ increases, often degrading overall mapping fidelity. This motivates methods for reducing distillation error while retaining fast inference (Feng et al., 20 Mar 2025).

2. Bezier-Curve Guided Distillation Framework

Bezier Distillation addresses error compounding by formulating the target flow as an $n$ th-degree Bezier curve in state space, parameterized by a series of control points $\{P_0, ..., P_n\}$ corresponding to: the initial sample ( $P_0=X_0$ ), one or more intermediate “teacher” distributions ( $P_i=X_{\tau_i}$ ), and the final data sample ( $P_n=X_1$ ).

The general Bezier curve is given by

$B(t) = \sum_{i=0}^n \binom{n}{i} (1-t)^{n-i} t^i P_i, \quad t\in[0,1]$

ensuring smooth, convex-hull-bounded paths between endpoints. The tangent at $t$ is $\dot{B}_n(t)$ —the time derivative of $B(t)$ . The control points $P_i$ are generated using teacher rectified flows at specific intermediate times ( $\tau_i$ ): $X_{\tau_i} = X_0 + U_{\tau_i}(X_0, 0)$ where $U_{\tau_i}$ is the rectified flow map at time $\tau_i$ .

The student network $v(\cdot, t)$ is trained to match the velocity of the Bezier curve: $\min_v \int_{t=0}^1 \mathbb{E}_{X_0, X_{\tau_1}, ..., X_1} \|\dot{B}_n(t) - v(B_n(t), t)\|^2 dt$ The loss reduces to earlier rectified-flow objectives when $n=1$ , and admits quadratic (one teacher) and cubic (two teachers) specializations detailed below.

Quadratic (Degree-2) Path

Control points: $P_0=X_0$ , $P_1=X_\tau$ , $P_2=X_1$
Bezier trajectory: $B_2(t) = (1-t)^2X_0 + 2t(1-t)X_\tau + t^2X_1$
Tangent: $\dot{B}_2(t) = 2\big[t(X_1-X_0) + (1-2t)U_\tau(X_0,0)\big]$
Loss: $\min_v\int_0^1 \mathbb{E}\left[\left\|t(X_1-X_0)+(1-2t)U_\tau(X_0,0)-v(B_2(t),t)\right\|^2\right]dt$

Cubic (Degree-3) Path, Multi-Teacher

Control points: $P_0=X_0$ , $P_1=X_\tau$ , $P_2=X_{\tau'}$ , $P_3=X_1$
Cubic curve and tangent as in Eqs. (8)-(9) of (Feng et al., 20 Mar 2025) with corresponding multi-teacher loss.

The framework generalizes to arbitrary degree $n$ , with teachers and control points at associated times $\tau_1,...,\tau_{n-1}$ .

3. Multi-Teacher Distillation Design

Bezier Distillation is inherently a multi-teacher knowledge distillation method. Each teacher consists of a (possibly multi-step) rectified flow map $U_{\tau_i}$ producing a distribution $T_{\tau_i}$ . Teacher guidance is realized by providing intermediate couplings, allowing the Bezier student to interpolate along more accurate and smooth paths compared to piecewise straight line or high-step rectified-flow distillation.

The student network is a parameterized vector field $v_\theta:\mathbb{R}^d\times[0,1]\rightarrow\mathbb{R}^d$ . At inference, trajectories are produced by numerically solving the ODE: $\frac{d X_t}{dt} = v_\theta(X_t, t)$ from $t=0$ ( $X_0$ ) towards $t=1$ ( $X_1$ ), requiring only a single (or few) function calls for fast sampling.

The objective can incorporate additional regularization or teacher-consistency terms: $L(\theta) = ... + \lambda\sum_i\mathbb{E}\left[\|v(X_{\tau_i}, \tau_i) - U_{\tau_i}(X_0, 0)\|^2\right]$ This provides a direct route for integrating multiple knowledge sources and controlling the tradeoff between teacher fidelity and student generalization.

4. Error Accumulation and Numerical Analysis

Standard rectified-flow distillation is sensitive to numerical errors accrued across repeated ODE solutions. Given integrator step size $h$ and order $p$ , the per-step discretization error is $O(h^p)$ , and with $k$ rectifications, the cumulative deviation scales as $O(k h^p)$ : $\|\widetilde{X}_k - X_1\| \approx \sum_{i=1}^{k}\|\epsilon_i\| = O(k h^p)$ The student’s final error inherits this accumulation in expectation: $\mathbb{E}\big[\|\widehat{T}(X_0) - X_1\|^2\big] \geq C_1 k^2 h^{2p} + C_2 \delta_{\rm model}^2$ with model fitting error $\delta_{\rm model}$ . As $k$ increases (for more accurate straightening), the effect of accumulated error outweighs the benefits of more “rectified” couplings, leading to suboptimal student performance.

Bezier Distillation alleviates this by interpolating through intermediate distributions generated by finite (limited) application of teacher flows, avoiding direct dependence on repeatedly composed, error-prone mappings. This results in a more robust student with reduced total error.

5. Training Procedure and Pseudocode

Training proceeds by constructing batches of Bezier-curve paths through control points generated by teacher flows. At each iteration:

Sample noise vectors $x_0^b$ from $T_0$ .
For each teacher $U_{\tau_i}$ , compute $x_{\tau_i}^b = x_0^b + U_{\tau_i}(x_0^b, 0)$ .
Sample $t$ uniformly in $[0,1]$ , and construct Bezier point $B^b$ with control points $x_0^b$ , $x_{\tau_1}^b$ , ..., $x_1^b$ .
Compute tangent $\dot{B}^b$ at $t$ .
Compute network output $v^b = v_\theta(B^b, t)$ and loss $L = (1/B) \sum_b \|\dot{B}^b - v^b\|^2$ .
Update parameters: $\theta \leftarrow \theta - \eta \nabla_\theta L$ .

At test time, $\dot{X} = v_\theta(X, t)$ is integrated from $t=0$ to $1$ starting at $X_0$ . The complete pseudocode is verbatim in (Feng et al., 20 Mar 2025).

6. Reported Empirical Observations and Open Issues

The available draft states that Bezier Distillation outperforms standard rectified-flow distillation with fewer iterations, achieves improved sample quality versus single- or two-step baselines, and exhibits strong performance in image-to-image translation tasks. The manuscript, however, does not specify:

Benchmark datasets (e.g., ImageNet, CIFAR-10, CelebA).
Quantitative performance metrics (e.g., FID, IS, PSNR, SSIM).
Detailed comparative results (baseline scores, number of function calls).
Ablation over curve degree ( $n$ ) and number of teachers.

The mathematical and algorithmic formulation provided facilitates reproducibility and independent benchmarking on standard image synthesis and translation datasets, allowing direct comparison with both classical rectified-flow models and alternative distillation or acceleration approaches.

7. Context and Significance

Bezier Distillation generalizes the distillation paradigm in ODE-based generative modeling by integrating multi-teacher supervision through Bezier-curve interpolation, providing a smoother and more robust framework for compressing deep generative flows. The formulation admits straightforward generalization to arbitrary interpolation paths and arbitrarily many teachers, and can be combined with existing consistency regularizers. A plausible implication is improved efficiency in sample synthesis and accelerated convergence for high-fidelity generative modeling, especially as multi-teacher and geometric guidance techniques gain prominence in diffusion and flow-based generative learning (Feng et al., 20 Mar 2025). Experimental completion and independent evaluation remain open for further confirmation and quantitative assessment.

Markdown Report Issue Upgrade to Chat

References (1)

Bezier Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bezier Distillation.

Bezier Distillation

1. Foundational Concepts and Motivation

2. Bezier-Curve Guided Distillation Framework

Quadratic (Degree-2) Path

Cubic (Degree-3) Path, Multi-Teacher

3. Multi-Teacher Distillation Design

4. Error Accumulation and Numerical Analysis

5. Training Procedure and Pseudocode

6. Reported Empirical Observations and Open Issues

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bezier Distillation

1. Foundational Concepts and Motivation

2. Bezier-Curve Guided Distillation Framework

Quadratic (Degree-2) Path

Cubic (Degree-3) Path, Multi-Teacher

3. Multi-Teacher Distillation Design

4. Error Accumulation and Numerical Analysis

5. Training Procedure and Pseudocode

6. Reported Empirical Observations and Open Issues

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research