Incremental Delay Update (IDU) in DQMC

Updated 2 December 2025

IDU is a computational optimization technique that postpones local Hubbard-Stratonovich updates and applies them in large blocks via efficient BLAS-3 operations.
It replaces sequential SMW updates with block updates while maintaining the formal O(N^3) complexity, significantly improving cache and vectorization performance.
Empirical benchmarks show up to 10× update speedup in finite-T simulations, making IDU crucial for scalable DQMC studies of correlated electron systems.

Incremental Delay Update (IDU), sometimes referred to simply as “delay update,” is a computational optimization scheme for determinant quantum Monte Carlo (DQMC) simulations of strongly correlated electron systems. IDU restructures the application of local Hubbard-Stratonovich field updates so that their cumulative effect is applied infrequently and in large blocks, rather than through immediate single-site updates, leveraging efficient matrix–matrix operations to achieve substantial speedups while preserving the underlying $\mathcal{O}(N^3)$ complexity of DQMC simulations (Sun et al., 2023).

1. Conceptual Framework

The core principle of Incremental Delay Update is the postponement of explicit modifications to the equal-time Green’s function matrix $G$ during Monte Carlo sweeps. In conventional DQMC, accepted local moves invoke the Sherman–Morrison–Woodbury (SMW) identity at each step, resulting in repeated application of small outer products (rank- $k$ updates where $k$ is the number of sites coupled to a field). Although each such update is individually efficient, when performed in rapid succession, they fail to exploit modern cache and vectorization capabilities, as they are limited by BLAS-1 or BLAS-2 performance.

IDU accumulates accepted local updates over a delay window of length $n_d$ and defers their application to $G$ until the buffer fills. During each step, only the determinant ratio (a $k\times k$ operation) is computed for accept/reject decisions, and relevant update matrices $U^{(i)}$ , $S^{(i)}$ , $V^{(i)}$ are stored. Upon reaching $n_d$ accepted updates, all changes are applied in a single block via an efficient matrix–matrix multiplication, making use of BLAS-3 routines, which deliver substantially higher computational throughput on modern hardware (Sun et al., 2023).

2. Mathematical Formalism and Update Accumulation

For a local update at time slice $\ell$ and sites $x_1, \ldots, x_k$ , the change is characterized by a diagonal matrix

$\Delta = \mathrm{diag}(\ldots,\Delta_{x_1 x_1},\ldots,\Delta_{x_k x_k},\ldots)_{N\times N}$

where $N$ is the system size and only $k$ entries are nonzero. The Metropolis acceptance ratio is computed as

$R = \det[I + \Delta(I - G)] = \det[S]$

with

$S = I_k + V D$

where $D$ is the $k\times k$ diagonal matrix of update amplitudes and $V_{ij} = -G_{x_i x_j} + \delta_{ij}$ .

After $n_d$ accepted moves, the net change to the Green’s function is

$G^{(\mathrm{after}\,n_d)} = G^{(\mathrm{before})} + \mathcal{A} \mathcal{S}^{-1} \mathcal{B}$

where $\mathcal{A}$ and $\mathcal{B}$ stack the collected $U^{(i)}$ and $V^{(i)}$ matrices over all delay steps, and $\mathcal{S}$ is block-diagonal in the individual $S^{(i)}$ matrices. This reproduces the sequential effect of $n_d$ SMW updates in one operation, at the cost of a single large matrix–matrix multiply (Sun et al., 2023).

3. Computational Complexity and Scaling

The algorithmic scaling of IDU remains formally identical to the conventional fast update scheme, $\mathcal{O}(\beta N^3)$ for finite-temperature DQMC, with $\beta$ the inverse temperature. For each accepted proposal, calculating determinant ratios and extracting $U$ , $V$ , $S$ costs $\mathcal{O}(k^3 + kN)$ ; the flush step (block update) after $n_d$ accepted moves costs $\mathcal{O}(n_d k N^2)$ .

IDU’s practical performance derives from blockwise conversion of $n_d$ sequential outer product updates (BLAS-1/2) into a handful of BLAS-3 matrix–matrix operations, which achieve $5$– $10\times$ higher throughput due to better cache locality and vectorization (Sun et al., 2023). The method’s memory footprint increases by the need to store $n_d$ copies of $U^{(i)}$ , $V^{(i)}$ , $S^{(i)}$ (with a total cost $\mathcal{O}(n_d k N)$ ), but this is negligible for onsite ( $k=1$ ) updates and remains manageable for $k>1$ .

4. Implementation Methodology

A typical implementation cycle in finite-temperature DQMC involves the following steps:

Select IDU buffer size $n_d$ , typically a power of two (e.g., $n_d=64$ ) or $n_d\approx\min[64, \lfloor N/20 \rfloor]$ , to optimize cache usage.
For each time slice $\ell$ , iterate over all sites, proposing new auxiliary field values and calculating the associated determinant ratio.
For each accepted proposal, extract and store $U$ , $V$ , $S$ for this move.
When $n_d$ accepted updates accrue, perform the flush: build block matrices $\mathcal{A}, \mathcal{B}$ and apply $G \gets G + \mathcal{A} \mathcal{B}$ .
At the end of the time slice, any remaining updates are flushed.
Propagate the Green’s function to the next slice using the usual DQMC matrix formalism.

Pseudocode for the update step is directly provided in (Sun et al., 2023). Flushing is always performed at buffer-capacity or time-slice boundaries; determinant ratios and virtual update extractions require only lightweight operations.

Numerical stability is maintained by ensuring that "virtually updated" Green’s functions are used for determinant calculations, and by periodic recomputation of $G$ via full $B$ -matrix propagation and stabilization.

5. Compatibility with DQMC Variants

IDU is agnostic to both finite- and zero-temperature formulations of DQMC:

Finite- $T$ DQMC: IDU directly substitutes the traditional SMW-based update pathway, as described, for both onsite and multi-site (extended interaction) updates.
Projector (Zero- $T$ ) DQMC: The method reformulates the computation in terms of the $G(\tau, \tau)$ Green’s function, analogous to the finite- $T$ case, with identical accumulation and flush logic. If the number of particles ( $N_p$ ) is significantly smaller than $N$ , SMW-based approaches may remain preferable, but for large $N$ the cache-friendly nature of IDU dominates performance.

IDU is fully compatible with onsite ( $k=1$ ; e.g., Hubbard model) and extended bond ( $k=2$ ; e.g., spinless $t\!-\!V$ ) interactions, provided the local update matrix can be diagonalized in an appropriate basis.

6. Performance Benchmarks

Quantitative results for IDU in simulation of square-lattice models on single CPU sockets demonstrate substantial empirical advantages (Sun et al., 2023):

Model	$N$	$k$	Fast Update (s)	Delay Update (s)	Overall Speedup
Finite- $T$ Hubbard	$42^2$	1	26 (85% total)	2.5	$~10\times$ update, $~7\times$ full
Zero- $T$ Hubbard	$42^2$	1	—	—	$1.5$– $2\times$ update
Finite- $T$ spinless $t\!-\!V$	$42^2$	2	128	20	$7\times$ update

Speedups are most pronounced for large $N$ and block sizes that fit cache hardware, enabling feasible simulations to scale to $N=58^2$ for the Hubbard model and $54^2$ for spinless $t\!-\!V$ , compared to smaller sizes attainable with conventional updates. The asymptotic scaling remains $\mathcal{O}(\beta N^3)$ ; the improvement manifests as a substantial constant-factor reduction, typically by $6$–$10$ for the update phase and $3$–$7$ for end-to-end simulation runtime.

7. Practical Considerations and Limitations

The effectiveness of IDU is contingent on the careful choice of delay length $n_d$ , with the optimal value dictated by cache size and system dimensions. Insufficient $n_d$ underuses BLAS-3 potential, while excessive $n_d$ risks overrunning cache, incurring memory penalties, or raising instability. The method relies on the numerical stability of virtual updates, with periodic restabilization mandated to control floating-point drift.

For onsite ( $k=1$ ) interactions, the overhead is minimal. For extended interactions ( $k>1$ ), additional memory is required but remains practical. IDU’s structure is simple and transparent to end users, requiring few interface changes and offering universal compatibility with standard DQMC codebases (Sun et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Delay Update in determinant quantum Monte Carlo (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Incremental Delay Update (IDU).