Papers
Topics
Authors
Recent
2000 character limit reached

Incremental Delay Update (IDU) in DQMC

Updated 2 December 2025
  • IDU is a computational optimization technique that postpones local Hubbard-Stratonovich updates and applies them in large blocks via efficient BLAS-3 operations.
  • It replaces sequential SMW updates with block updates while maintaining the formal O(N^3) complexity, significantly improving cache and vectorization performance.
  • Empirical benchmarks show up to 10× update speedup in finite-T simulations, making IDU crucial for scalable DQMC studies of correlated electron systems.

Incremental Delay Update (IDU), sometimes referred to simply as “delay update,” is a computational optimization scheme for determinant quantum Monte Carlo (DQMC) simulations of strongly correlated electron systems. IDU restructures the application of local Hubbard-Stratonovich field updates so that their cumulative effect is applied infrequently and in large blocks, rather than through immediate single-site updates, leveraging efficient matrix–matrix operations to achieve substantial speedups while preserving the underlying O(N3)\mathcal{O}(N^3) complexity of DQMC simulations (Sun et al., 2023).

1. Conceptual Framework

The core principle of Incremental Delay Update is the postponement of explicit modifications to the equal-time Green’s function matrix GG during Monte Carlo sweeps. In conventional DQMC, accepted local moves invoke the Sherman–Morrison–Woodbury (SMW) identity at each step, resulting in repeated application of small outer products (rank-kk updates where kk is the number of sites coupled to a field). Although each such update is individually efficient, when performed in rapid succession, they fail to exploit modern cache and vectorization capabilities, as they are limited by BLAS-1 or BLAS-2 performance.

IDU accumulates accepted local updates over a delay window of length ndn_d and defers their application to GG until the buffer fills. During each step, only the determinant ratio (a k×kk\times k operation) is computed for accept/reject decisions, and relevant update matrices U(i)U^{(i)}, S(i)S^{(i)}, V(i)V^{(i)} are stored. Upon reaching ndn_d accepted updates, all changes are applied in a single block via an efficient matrix–matrix multiplication, making use of BLAS-3 routines, which deliver substantially higher computational throughput on modern hardware (Sun et al., 2023).

2. Mathematical Formalism and Update Accumulation

For a local update at time slice \ell and sites x1,,xkx_1, \ldots, x_k, the change is characterized by a diagonal matrix

Δ=diag(,Δx1x1,,Δxkxk,)N×N\Delta = \mathrm{diag}(\ldots,\Delta_{x_1 x_1},\ldots,\Delta_{x_k x_k},\ldots)_{N\times N}

where NN is the system size and only kk entries are nonzero. The Metropolis acceptance ratio is computed as

R=det[I+Δ(IG)]=det[S]R = \det[I + \Delta(I - G)] = \det[S]

with

S=Ik+VDS = I_k + V D

where DD is the k×kk\times k diagonal matrix of update amplitudes and Vij=Gxixj+δijV_{ij} = -G_{x_i x_j} + \delta_{ij}.

After ndn_d accepted moves, the net change to the Green’s function is

G(afternd)=G(before)+AS1BG^{(\mathrm{after}\,n_d)} = G^{(\mathrm{before})} + \mathcal{A} \mathcal{S}^{-1} \mathcal{B}

where A\mathcal{A} and B\mathcal{B} stack the collected U(i)U^{(i)} and V(i)V^{(i)} matrices over all delay steps, and S\mathcal{S} is block-diagonal in the individual S(i)S^{(i)} matrices. This reproduces the sequential effect of ndn_d SMW updates in one operation, at the cost of a single large matrix–matrix multiply (Sun et al., 2023).

3. Computational Complexity and Scaling

The algorithmic scaling of IDU remains formally identical to the conventional fast update scheme, O(βN3)\mathcal{O}(\beta N^3) for finite-temperature DQMC, with β\beta the inverse temperature. For each accepted proposal, calculating determinant ratios and extracting UU, VV, SS costs O(k3+kN)\mathcal{O}(k^3 + kN); the flush step (block update) after ndn_d accepted moves costs O(ndkN2)\mathcal{O}(n_d k N^2).

IDU’s practical performance derives from blockwise conversion of ndn_d sequential outer product updates (BLAS-1/2) into a handful of BLAS-3 matrix–matrix operations, which achieve $5$–10×10\times higher throughput due to better cache locality and vectorization (Sun et al., 2023). The method’s memory footprint increases by the need to store ndn_d copies of U(i)U^{(i)}, V(i)V^{(i)}, S(i)S^{(i)} (with a total cost O(ndkN)\mathcal{O}(n_d k N)), but this is negligible for onsite (k=1k=1) updates and remains manageable for k>1k>1.

4. Implementation Methodology

A typical implementation cycle in finite-temperature DQMC involves the following steps:

  • Select IDU buffer size ndn_d, typically a power of two (e.g., nd=64n_d=64) or ndmin[64,N/20]n_d\approx\min[64, \lfloor N/20 \rfloor], to optimize cache usage.
  • For each time slice \ell, iterate over all sites, proposing new auxiliary field values and calculating the associated determinant ratio.
  • For each accepted proposal, extract and store UU, VV, SS for this move.
  • When ndn_d accepted updates accrue, perform the flush: build block matrices A,B\mathcal{A}, \mathcal{B} and apply GG+ABG \gets G + \mathcal{A} \mathcal{B}.
  • At the end of the time slice, any remaining updates are flushed.
  • Propagate the Green’s function to the next slice using the usual DQMC matrix formalism.

Pseudocode for the update step is directly provided in (Sun et al., 2023). Flushing is always performed at buffer-capacity or time-slice boundaries; determinant ratios and virtual update extractions require only lightweight operations.

Numerical stability is maintained by ensuring that "virtually updated" Green’s functions are used for determinant calculations, and by periodic recomputation of GG via full BB-matrix propagation and stabilization.

5. Compatibility with DQMC Variants

IDU is agnostic to both finite- and zero-temperature formulations of DQMC:

  • Finite-TT DQMC: IDU directly substitutes the traditional SMW-based update pathway, as described, for both onsite and multi-site (extended interaction) updates.
  • Projector (Zero-TT) DQMC: The method reformulates the computation in terms of the G(τ,τ)G(\tau, \tau) Green’s function, analogous to the finite-TT case, with identical accumulation and flush logic. If the number of particles (NpN_p) is significantly smaller than NN, SMW-based approaches may remain preferable, but for large NN the cache-friendly nature of IDU dominates performance.

IDU is fully compatible with onsite (k=1k=1; e.g., Hubbard model) and extended bond (k=2k=2; e.g., spinless t ⁣ ⁣Vt\!-\!V) interactions, provided the local update matrix can be diagonalized in an appropriate basis.

6. Performance Benchmarks

Quantitative results for IDU in simulation of square-lattice models on single CPU sockets demonstrate substantial empirical advantages (Sun et al., 2023):

Model NN kk Fast Update (s) Delay Update (s) Overall Speedup
Finite-TT Hubbard 42242^2 1 26 (85% total) 2.5  10×~10\times update,  7×~7\times full
Zero-TT Hubbard 42242^2 1 $1.5$–2×2\times update
Finite-TT spinless t ⁣ ⁣Vt\!-\!V 42242^2 2 128 20 7×7\times update

Speedups are most pronounced for large NN and block sizes that fit cache hardware, enabling feasible simulations to scale to N=582N=58^2 for the Hubbard model and 54254^2 for spinless t ⁣ ⁣Vt\!-\!V, compared to smaller sizes attainable with conventional updates. The asymptotic scaling remains O(βN3)\mathcal{O}(\beta N^3); the improvement manifests as a substantial constant-factor reduction, typically by $6$–$10$ for the update phase and $3$–$7$ for end-to-end simulation runtime.

7. Practical Considerations and Limitations

The effectiveness of IDU is contingent on the careful choice of delay length ndn_d, with the optimal value dictated by cache size and system dimensions. Insufficient ndn_d underuses BLAS-3 potential, while excessive ndn_d risks overrunning cache, incurring memory penalties, or raising instability. The method relies on the numerical stability of virtual updates, with periodic restabilization mandated to control floating-point drift.

For onsite (k=1k=1) interactions, the overhead is minimal. For extended interactions (k>1k>1), additional memory is required but remains practical. IDU’s structure is simple and transparent to end users, requiring few interface changes and offering universal compatibility with standard DQMC codebases (Sun et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Incremental Delay Update (IDU).