Softmax-Relaxed Assignments Overview
- Softmax-relaxed assignments are continuous relaxations of discrete combinatorial problems, enabling gradient-based learning in models with discrete latent structures.
- They utilize smooth convex surrogates like negative entropy regularization and the Sinkhorn operator to approximate hard selection processes.
- Empirical studies demonstrate improved precision, lower variance gradient estimates, and scalable performance in applications like neural attention and variational inference.
Softmax-relaxed assignments are continuous relaxations of discrete assignment distributions, in which combinatorial “hard” choices (such as permutation matrices in the assignment problem) are approximated by differentiable surrogates. These surrogates enable gradient-based optimization in models containing discrete latent structures by smoothing the sampling or selection process and supporting reparameterized gradients. Such relaxations are fundamental in scalable structured latent variable models, neural attention mechanisms, and variational inference frameworks.
1. Foundations of Softmax-Relaxed Assignments
Softmax-relaxed assignment distributions are grounded in the stochastic perturb-and-max (or “Gumbel-Max trick”) framework. Given a finite combinatorial set (e.g., all permutation matrices for the assignment problem), the generative model proceeds as follows: draw a random utility with density , then set
This yields an exact discrete sample. However, the discontinuity of prohibits backpropagation.
The softmax relaxation replaces the hard with a smooth convex program. Define as the convex hull of , and as a strongly convex regularizer (e.g., entrywise negative Shannon entropy). For temperature 0,
1
Because 2 is strongly convex, the maximizer 3 is unique, continuous, and almost-everywhere differentiable; as 4, 5 almost surely (Paulus et al., 2020).
In the assignment case (6 permutation matrices), the Birkhoff polytope serves as 7, and the relaxation becomes doubly-stochastic.
2. Stochastic Softmax Trick and Sinkhorn Operator
For assignments, let 8 be the set of 9 permutation matrices, with 0 the Birkhoff polytope. Generate an 1 matrix of i.i.d. Gumbel noise 2 and set 3 (with 4 the parameter logits). The “hard” stochastic argmax sample is
5
The softmax relaxation is
6
which is equivalent to applying the Sinkhorn operator to 7: 8, i.e., iteratively normalizing the exponentiated matrix along rows and columns until it becomes doubly-stochastic.
The resulting distribution 9 is the marginal of an exponential family over 0 (Paulus et al., 2020).
3. Gradient Estimation and Differentiation
Softmax-relaxed assignments facilitate low-variance, reparameterized gradient estimation for functions of latent discrete structures. To optimize 1, the usual surrogate is 2, with gradients computed via
3
The Jacobian 4 can be calculated by implicit differentiation of Sinkhorn or by finite differences along 5 (Paulus et al., 2020). The estimator is unbiased for 6 and, due to smooth reparameterization, typically exhibits much lower variance than REINFORCE-type estimators.
4. Relation to Gumbel-Softmax and Categorical Relaxations
The softmax-relaxed assignment (via Sinkhorn and the Birkhoff polytope) generalizes the well-studied Gumbel-Softmax (Concrete) relaxation used for categorical variables:
7
where 8 and 9 (Oh et al., 2022). As 0, true one-hot categorical sampling is recovered. For assignment problems, the continuous relaxation requires iterated normalization (Sinkhorn), not a single-step softmax; Gumbel-Softmax operates on simplex one-hot vectors, whereas the assignment relaxation acts over the Birkhoff polytope (doubly-stochastic matrices).
Both approaches introduce a temperature controlling the proximity to “hard” selection; both enjoy differentiability and reparameterization. However, the Sinkhorn-based relaxation can exploit combinatorial structure and achieves lower variance than naive independent softmax relaxations on the matrix entries (Paulus et al., 2020).
5. Analytical Bounds and Variational Objectives
Estimating divergences involving relaxed discrete distributions is analytically challenging: the exact KL divergence between two RelaxedCategorical distributions is intractable. The ReCAB framework provides a closed-form, temperature-aware upper bound on 1 for relaxed categorical distributions:
2
where 3 are the logits and temperature of the posterior, 4 the prior, and 5 the Euler–Mascheroni constant (Oh et al., 2022).
This bound is deterministic, analytic, and explicitly temperature-aware, avoiding stochastic estimation noise and remaining a valid upper bound for the true relaxed-categorical KL. When integrated into variational objectives (e.g., ReCAB-VAE), it leads to more stable and accurate training compared to either naive categorical approximations or Monte Carlo KL estimates. Empirical evidence shows ReCAB closely matches the KL estimated from large-sample Monte Carlo (Oh et al., 2022).
6. Empirical Evaluations and Practical Implementation
Empirical studies validate the efficacy of softmax-relaxed assignments in structured latent models. In neural relational inference with latent graphs of 10 nodes, Gumbel-Sinkhorn (Birkhoff-SST) recovers true spanning-tree graphs with ∼99% precision/recall and ELBO improved by ≈200 nats versus independent-edge or REINFORCE baselines. In unsupervised parsing (ListOps), structured SSTs yield higher task accuracy (∼95% vs 89–91%) and edge precision relative to simpler relaxations. For learning-to-explain (L2X), structure-aware SSTs discover more contiguous and precise subsets with slightly lower MSE than bespoke relaxations (Paulus et al., 2020).
The practical routine is:
- Select the combinatorial domain 6 (e.g., permutation matrices).
- Sample Gumbel noise and form 7.
- Compute relaxed assignment 8.
- Evaluate the loss 9 and backpropagate through the Sinkhorn operator (via autodiff or finite differences).
- Either anneal temperature 0 or treat it as a hyperparameter; optimize 1 and evaluate with “hard” assignments (Paulus et al., 2020).
7. Extensions: Sparse and Tunable Alternatives
Beyond the negative-entropy regularized softmax relaxation, a unified projection framework (“sparsegen”) yields controllable sparse relaxations that include Softmax, Sparsemax, Spherical Softmax, and tunable variants as special cases:
2
where 3 and 4 are tunable. By appropriately choosing 5 and 6, the projection recovers Softmax, Sparsemax, and newly introduced sparsegen-lin and sparsehourglass mappings, which allow explicit control over sparsity and support closed-form solutions and subgradients. These are empirically shown to achieve crisper, more interpretable neural attention and improved sequence-to-sequence metrics, with the average number of encoder positions attended per decoder step decreasing as sparsity is increased—without sacrificing BLEU or ROUGE score (Laha et al., 2018).
| Relaxation | Domain | Normalization |
|---|---|---|
| Gumbel-Softmax | Probability simplex | Softmax |
| Softmax-relaxed assignment (SST) | Birkhoff polytope (doubly-stochastic) | Sinkhorn (iterative norm) |
| Sparsegen-lin/hourglass | Probability simplex | Tunable (projection view) |
8. Summary and Outlook
Softmax-relaxed assignments provide a mathematically principled, flexible framework for gradient-based optimization in models involving discrete combinatorial structures. Through the use of smooth convex relaxations, such as those defined by negative entropy or projection-based frameworks, as well as efficient Sinkhorn normalization for permutation-based problems, these methods bridge combinatorial and continuous spaces. Analytical advances (e.g., ReCAB) further enhance the tractability and stability of variational objectives, while empirical results confirm the scalability and performance benefits of structure-aware relaxations in multiple domains (Paulus et al., 2020, Oh et al., 2022, Laha et al., 2018).