Birkhoff Polytope

Updated 1 January 2026

Birkhoff polytope is the set of all n×n doubly stochastic matrices whose vertices are permutation matrices, serving as a cornerstone in combinatorics and optimization.
It underlies efficient algorithms like the Sinkhorn–Knopp scaling method, enabling precise projections in entropy-regularized optimal transport problems.
Its rich geometric and combinatorial structure supports advances in machine learning, quantum computing, and matrix balancing through scalable, fast algorithms.

The Birkhoff polytope, also known as the polytope of doubly stochastic matrices, is a central object in algebraic combinatorics, matrix analysis, convex geometry, and optimization. It is intimately linked to the theory of matrix scaling, optimal transport, entropy-regularized OT, and a class of algorithms, most notably the Sinkhorn–Knopp matrix-scaling algorithm.

1. Definition and Fundamental Properties

The Birkhoff polytope $B_n$ is the convex polytope whose points are the $n\times n$ doubly stochastic matrices: $B_n = \left\{ X \in \mathbb{R}^{n \times n}: X_{ij} \ge 0, ~ \sum_{j=1}^n X_{ij} = 1 ~ \forall i, ~ \sum_{i=1}^n X_{ij} = 1 ~ \forall j \right\}.$ Its vertices are precisely the set of $n\times n$ permutation matrices. Birkhoff’s theorem states that every doubly stochastic matrix is a convex combination of permutation matrices.

Key properties include:

$\dim B_n = (n-1)^2$
$B_n$ is a convex, compact polytope in $\mathbb{R}^{n^2}$
The extreme points correspond to the $n!$ permutation matrices.

The Birkhoff polytope $B_n$ sits as a face of the set of nonnegative matrices and is the intersection of the affine space of row- and column-sum-1 matrices with the positive orthant. It is simple to describe by linear constraints and has a rich combinatorial structure.

In entropic OT and matrix scaling settings, one works with either $B_n$ itself or generalized transportation polytopes (allowing arbitrary prescribed positive row and column sums), of which $B_n$ is a special case for all ones.

3. Matrix Scaling, Entropic Regularization, and Sinkhorn–Knopp

The most algorithmically significant connection of $B_n$ arises in entropy-regularized optimal transport: $\min_{P \in B_n} \langle P, C \rangle + \varepsilon \sum_{i,j} P_{ij} \log P_{ij},$ where $C$ is a cost matrix and $\varepsilon>0$ is the regularization strength (Cuturi, 2013).

The Sinkhorn–Knopp algorithm provides a practical means for projecting a strictly positive matrix to $B_n$ via diagonal scaling:

Given $A > 0$ , alternately scale rows and columns to sum to 1. Convergence is geometric under mild conditions (Cuturi, 2013).
The limit is doubly stochastic, i.e., a point in $B_n$ .

When $\varepsilon \to 0$ , the solution approaches the optimal vertex (permutation matrix), while for $\varepsilon > 0$ the minimizer is unique and lies in $\mathrm{int}(B_n)$ .

4. Geometric and Optimization-Theoretic Interpretation

From the perspective of convex geometry:

$B_n$ is the feasible region for matrix balancing and the constraint polytope for entropy-regularized assignment problems.
Projection in relative entropy (KL divergence) onto $B_n$ is equivalent to iterative application of Bregman projections, concretely realized as the row-/column-scaling steps of Sinkhorn–Knopp (Cuturi, 2013).
The Birkhoff polytope is the set of marginal-preserving couplings in OT, and its structure governs the space of feasible transport plans.

Optimal transport solvers compute projections onto $B_n$ or its generalizations, and the geometry of $B_n$ underlies the behavior and guarantees of such methods (Cuturi, 2013).

5. Algorithmic and Computational Complexity Connections

Matrix scaling to $B_n$ (or to transportation polytopes) is a core routine in several domains:

Each Sinkhorn–Knopp iteration is $O(n^2)$ .
For dense cost matrices, the overall complexity to reach an $\ell_1$ - or KL-divergence-accurate point in $B_n$ is $O(n^2 \log(n/\varepsilon))$ for matrices with uniform density above $1/2$, which is information-theoretically optimal (He, 13 Jul 2025).
Sparsity and zero patterns in $A$ can move a problem outside the class for which $B_n$ is computationally easily accessed.

Algorithmic realizations include vectorized, GPU-parallel, and large-scale variants due to the simplex structure of $B_n$ and the simplicity of Sinkhorn’s updates (Cuturi, 2013).

6. Applications Across Fields

The Birkhoff polytope underpins:

Entropy-regularized assignment and matching problems.
Preconditioning and balancing of matrices for solving linear systems.
Kernel normalization in machine learning (balancing Gram matrices, e.g., for SMILES string analysis (Ali et al., 2024)).
Structured kernel methods, as balancing to $B_n$ ensures fair marginalization and prevents entries from dominating similarity measures.
Quantum information and representation theory (unitary variants of the Birkhoff polytope).

In OT, the Birkhoff–von Neumann theorem (decomposition into permutations) is exploited in the design and certification of assignment algorithms.

7. Advanced Topics and Recent Developments

Recent research explores several directions:

Improved complexity and phase transition results depending on matrix density and error norm (He, 13 Jul 2025).
Extensions to constrained transportation polytopes, introducing zeros into the support, leading to faces or lower-dimensional analogues of $B_n$ (Corless et al., 2024).
Differentiation through Sinkhorn layers (i.e., projections onto $B_n$ ) in deep learning, leveraging the analytic structure for efficient backpropagation (Eisenberger et al., 2022).
Connections to stochastic mirror descent and convex duality, where projection onto $B_n$ is viewed as Bregman (KL) projection, and the full iteration corresponds to alternating minimization in composite entropy formulations (Mishchenko, 2019).
Generalization to the “unitary” Birkhoff polytope (scaling unitary matrices to have prescribed line sums) for applications in quantum circuit decomposition (Vos et al., 2014).

Summary Table: Core Structural Facts

Feature	Description	Reference / Context
Definition	$B_n=\{\text{doubly stochastic } n\times n \text{ matrices}\}$	Birkhoff’s theorem
Vertices	Permutation matrices ( $n!$ total)	Convex hull characterization
Dimensionality	$(n-1)^2$	Polytope geometry
Algorithmic projection	Sinkhorn–Knopp scaling (alternating row/col normalization)	(Cuturi, 2013)
Role in OT	Feasible set for assignment and OT; support of entropy-regularized plans	(Cuturi, 2013)
Complexity	$O(n^2)$ per iteration; $O(n^2\log(n/\varepsilon))$ total (dense case)	(Cuturi, 2013, He, 13 Jul 2025)
Applications	Optimal transport, kernel normalization, preconditioning, assignments	(Cuturi, 2013, Ali et al., 2024)

The Birkhoff polytope forms the mathematical, algorithmic, and geometric core of a wide spectrum of problems in computational mathematics, machine learning, combinatorial optimization, and theoretical computer science, providing both a canonical feasible set and an anchor for fast approximation and regularization methods (Cuturi, 2013, He, 13 Jul 2025, Ali et al., 2024, Vos et al., 2014, Eisenberger et al., 2022, Mishchenko, 2019, Corless et al., 2024).