Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep transformation models: Tackling complex regression problems with neural network based transformation models (2004.00464v1)

Published 1 Apr 2020 in stat.ML and cs.LG

Abstract: We present a deep transformation model for probabilistic regression. Deep learning is known for outstandingly accurate predictions on complex data but in regression tasks, it is predominantly used to just predict a single number. This ignores the non-deterministic character of most tasks. Especially if crucial decisions are based on the predictions, like in medical applications, it is essential to quantify the prediction uncertainty. The presented deep learning transformation model estimates the whole conditional probability distribution, which is the most thorough way to capture uncertainty about the outcome. We combine ideas from a statistical transformation model (most likely transformation) with recent transformation models from deep learning (normalizing flows) to predict complex outcome distributions. The core of the method is a parameterized transformation function which can be trained with the usual maximum likelihood framework using gradient descent. The method can be combined with existing deep learning architectures. For small machine learning benchmark datasets, we report state of the art performance for most dataset and partly even outperform it. Our method works for complex input data, which we demonstrate by employing a CNN architecture on image data.

Citations (25)

Summary

  • The paper introduces deep transformation models, a neural network framework for probabilistic regression that estimates the full conditional probability distribution rather than just a point estimate.
  • The model constructs a flexible transformation by chaining functions, including a Bernstein polynomial MLT, with parameters learned by neural networks to capture complex distribution shapes.
  • Experiments demonstrate the model's ability to effectively capture complex outcomes like multimodality and heteroscedasticity, showing competitive performance on various benchmark datasets.

The paper introduces a neural network based transformation model for probabilistic regression that estimates the full conditional probability distribution (CPD) of a real-valued response rather than merely a point estimate. The approach integrates concepts from classical statistical transformation models—specifically the most likely transformation (MLT)—with ideas from deep learning’s normalizing flows (NF). The resulting framework is capable of capturing complex, non-Gaussian and heteroscedastic outcome distributions.

Model Architecture and Methodology

  • Chained Transformation Functions:

The model constructs a parameterized bijective transformation by composing a sequence of transformation functions. The overall transformation is given by

z=hθ(y)=f3,α(x),β(x)f2,ϑ(x)σf1,a(x),b(x)(y),z = h_\theta(y) = f_{3,\alpha(x),\beta(x)} \circ f_{2,\vartheta(x)} \circ \sigma \circ f_{1,a(x),b(x)}(y),

where: - f1f_1 scales and shifts the input yy to the interval [0,1][0,1], and is followed by a sigmoid layer to ensure the range requirement. Here, a(x)>0a(x) > 0 is enforced via a softplus activation. - f2f_2 implements the MLT using a Bernstein polynomial basis of order MM,

hϑMLT(y~x)=i=0MBei(y~)ϑi(x)M+1,h^{\operatorname{MLT}}_{\vartheta}( \tilde{y} \mid x) = \sum_{i=0}^M \operatorname{Be}_i(\tilde{y}) \frac{\vartheta_i(x)}{M+1},

ensuring monotonicity by imposing ϑ0<ϑ1<<ϑM\vartheta_0 < \vartheta_1 < \ldots < \vartheta_M. This flexible representation allows the model to capture multimodality and non-linear shapes in the conditional distributions. - f3f_3 applies a second scale and shift transformation to align the output with a standard normal distribution. Its parameters, α(x)\alpha(x) and β(x)\beta(x), are also adaptively estimated from xx.

  • Parameter Estimation via Neural Networks:

Each set of transformation parameters (i.e., (a,b)(a, b), ϑ0,,ϑM\vartheta_0,\ldots,\vartheta_M, and (α,β)(\alpha, \beta)) is generated by a neural network whose input is the conditioning variable xx. This end-to-end architecture allows the model to learn complex dependencies between the input and the shape of the target distribution. The training objective is set up through maximum likelihood estimation, imposing the standard change-of-variable formula

fy(yx)=fz(hθ(y))hθ(y),f_y(y \mid x) = f_z(h_\theta(y)) \cdot \left| h'_\theta(y) \right|,

where fzf_z is a simple base density (typically Gaussian) and the Jacobian term hθ(y)|h'_\theta(y)| ensures proper normalization.

Experimental Evaluation

  • Simulated One-Dimensional Data:
    • A sinusoidal process with heteroscedastic, exponentially distributed noise, where the deep transformation model (DL_MLT with M=10M=10) produces CPDs that adapt to changes in variance and potentially bimodal structure, achieving an NLL of –2.00 compared to –0.85 for a simple linear transformation model (LTM).
    • A challenging bimodal outcome where the spread between modes is xx-dependent, with DL_MLT effectively capturing such complexity while the LTM fails to adapt sufficiently.
  • UCI Benchmark Datasets:
    • Numerical results demonstrate that DL_MLT is competitive and, in certain cases (e.g., Naval, Wine), it clearly outperforms competing methods.
    • For instance, on the Wine dataset, the model’s capability to predict multimodal CPDs (accounting for the inherent discrete-like quality of subjective wine quality ratings) leads to notable improvement in NLL. Increasing the Bernstein polynomial order from M=10M=10 to M=20M=20 in this case further reduces the test NLL from an already competitive value to 0.40±0.0250.40 \pm 0.025.
  • Age Estimation from Facial Images:

The model is also applied to the UTKFace dataset for age estimation. Here, a convolutional neural network (CNN) extracts features from images, which then drive the transformation model. The method accounts for both the non-negativity of age and the increasing uncertainty in age estimation for older subjects. After training, the model achieves a test NLL of 3.83 and produces CPDs that are narrow for infants but progressively broaden with increasing age, aligning with expected domain-specific uncertainty.

Discussion and Future Directions

  • Interpretability and Flexibility:

The integration of Bernstein polynomial based transformations ensures that the CPDs are smooth and monotonic, thereby avoiding the overfitting tendencies sometimes observed with mixture density networks. The approach is flexible enough to model arbitrary distribution shapes while maintaining a light regularization requirement for moderately sized datasets.

  • Extensions:

The paper discusses potential extensions toward handling discrete outcomes, as well as truncated or censored data—an important characteristic in survival analysis and other fields. For example, adapting the likelihood for ordered categories or right-censoring would further generalize the model.

  • Training Considerations:

The model requires careful attention to training duration; certain datasets (e.g., Naval) necessitate significantly longer training iterations to capture very narrow spikes in the CPD. Nonetheless, the overall architecture demonstrates robust performance without extensive hyperparameter tuning.

In summary, the paper presents a technically rigorous and flexible probabilistic regression model that leverages deep learning for estimating complex conditional distributions. The method’s ability to integrate classical transformation model concepts with modern normalizing flows makes it a promising tool for applications where uncertainty quantification is critical.