- The paper introduces deep transformation models, a neural network framework for probabilistic regression that estimates the full conditional probability distribution rather than just a point estimate.
- The model constructs a flexible transformation by chaining functions, including a Bernstein polynomial MLT, with parameters learned by neural networks to capture complex distribution shapes.
- Experiments demonstrate the model's ability to effectively capture complex outcomes like multimodality and heteroscedasticity, showing competitive performance on various benchmark datasets.
The paper introduces a neural network based transformation model for probabilistic regression that estimates the full conditional probability distribution (CPD) of a real-valued response rather than merely a point estimate. The approach integrates concepts from classical statistical transformation models—specifically the most likely transformation (MLT)—with ideas from deep learning’s normalizing flows (NF). The resulting framework is capable of capturing complex, non-Gaussian and heteroscedastic outcome distributions.
Model Architecture and Methodology
- Chained Transformation Functions:
The model constructs a parameterized bijective transformation by composing a sequence of transformation functions. The overall transformation is given by
z=hθ(y)=f3,α(x),β(x)∘f2,ϑ(x)∘σ∘f1,a(x),b(x)(y),
where:
- f1 scales and shifts the input y to the interval [0,1], and is followed by a sigmoid layer to ensure the range requirement. Here, a(x)>0 is enforced via a softplus activation.
- f2 implements the MLT using a Bernstein polynomial basis of order M,
hϑMLT(y~∣x)=i=0∑MBei(y~)M+1ϑi(x),
ensuring monotonicity by imposing ϑ0<ϑ1<…<ϑM. This flexible representation allows the model to capture multimodality and non-linear shapes in the conditional distributions.
- f3 applies a second scale and shift transformation to align the output with a standard normal distribution. Its parameters, α(x) and β(x), are also adaptively estimated from x.
- Parameter Estimation via Neural Networks:
Each set of transformation parameters (i.e., (a,b), ϑ0,…,ϑM, and (α,β)) is generated by a neural network whose input is the conditioning variable x. This end-to-end architecture allows the model to learn complex dependencies between the input and the shape of the target distribution. The training objective is set up through maximum likelihood estimation, imposing the standard change-of-variable formula
fy(y∣x)=fz(hθ(y))⋅∣hθ′(y)∣,
where fz is a simple base density (typically Gaussian) and the Jacobian term ∣hθ′(y)∣ ensures proper normalization.
Experimental Evaluation
- Simulated One-Dimensional Data:
- A sinusoidal process with heteroscedastic, exponentially distributed noise, where the deep transformation model (DL_MLT with M=10) produces CPDs that adapt to changes in variance and potentially bimodal structure, achieving an NLL of –2.00 compared to –0.85 for a simple linear transformation model (LTM).
- A challenging bimodal outcome where the spread between modes is x-dependent, with DL_MLT effectively capturing such complexity while the LTM fails to adapt sufficiently.
- UCI Benchmark Datasets:
- Numerical results demonstrate that DL_MLT is competitive and, in certain cases (e.g., Naval, Wine), it clearly outperforms competing methods.
- For instance, on the Wine dataset, the model’s capability to predict multimodal CPDs (accounting for the inherent discrete-like quality of subjective wine quality ratings) leads to notable improvement in NLL. Increasing the Bernstein polynomial order from M=10 to M=20 in this case further reduces the test NLL from an already competitive value to 0.40±0.025.
- Age Estimation from Facial Images:
The model is also applied to the UTKFace dataset for age estimation. Here, a convolutional neural network (CNN) extracts features from images, which then drive the transformation model. The method accounts for both the non-negativity of age and the increasing uncertainty in age estimation for older subjects. After training, the model achieves a test NLL of 3.83 and produces CPDs that are narrow for infants but progressively broaden with increasing age, aligning with expected domain-specific uncertainty.
Discussion and Future Directions
- Interpretability and Flexibility:
The integration of Bernstein polynomial based transformations ensures that the CPDs are smooth and monotonic, thereby avoiding the overfitting tendencies sometimes observed with mixture density networks. The approach is flexible enough to model arbitrary distribution shapes while maintaining a light regularization requirement for moderately sized datasets.
The paper discusses potential extensions toward handling discrete outcomes, as well as truncated or censored data—an important characteristic in survival analysis and other fields. For example, adapting the likelihood for ordered categories or right-censoring would further generalize the model.
The model requires careful attention to training duration; certain datasets (e.g., Naval) necessitate significantly longer training iterations to capture very narrow spikes in the CPD. Nonetheless, the overall architecture demonstrates robust performance without extensive hyperparameter tuning.
In summary, the paper presents a technically rigorous and flexible probabilistic regression model that leverages deep learning for estimating complex conditional distributions. The method’s ability to integrate classical transformation model concepts with modern normalizing flows makes it a promising tool for applications where uncertainty quantification is critical.