Programmable Probabilistic Computer with 1,000,000 p-bits

Published 24 Jun 2026 in cs.DC and cs.AR | (2606.25313v1)

Abstract: Probabilistic computers built from p-bits have been proposed as hardware accelerators for sampling and optimizing Ising models, but existing systems have been confined to a single chip, capped by its capacity and memory bandwidth. Here we break this limit by networking FPGAs into a single Ising machine far larger than any one device could hold, realizing a programmable probabilistic computer with one million p-bits. The machine performs Gibbs sampling at over a trillion flips per second while keeping every coupling weight in local on-chip memory. During execution, devices exchange nothing but 1-bit boundary states. This architecture exposes a question fundamental to any distributed sampler: how frequently boundary information must be refreshed for a partitioned machine to behave as an unpartitioned one. Using three-dimensional Edwards-Anderson spin glasses, we show that the answer is set by a single timing ratio, eta = f_comm/f_p-bit, of the boundary-exchange frequency to the local p-bit update frequency. Above a topology-dependent threshold, the distributed machine matches a monolithic GPU reference. Below it, residual energy still decays as a power law but with a reduced exponent, turning parallelism into a quantifiable throughput-accuracy tradeoff. A theoretical cluster mean-field model reproduces the same behavior, showing that this tradeoff is a universal property of partitioned stochastic dynamics. These results provide a programmable million-p-bit platform, demonstrated across spin glasses, Max-Cut, and Boolean satisfiability, together with a quantitative design rule for scaling probabilistic computers beyond the single-chip limit.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces a distributed architecture that leverages 1,000,000 p-bits on multiple FPGAs to achieve high-throughput sampling and combinatorial optimization.
It demonstrates that the solution quality depends solely on the communication-to-p-bit-update ratio, allowing a tunable tradeoff between speed and accuracy.
Empirical benchmarks confirm significant performance gains, including up to 62× faster processing and state-of-the-art results on spin glass, Max-Cut, and 3SAT problems.

Programmable Probabilistic Computing with One Million p-bits: Architecture, Scaling, and Theoretical Principles

Introduction

The work "Programmable Probabilistic Computer with 1,000,000 p-bits" (2606.25313) demonstrates a distributed architecture for scalable stochastic computation by networking field-programmable gate arrays (FPGAs) into a programmable Ising machine with one million probabilistic bits (p-bits). This system represents a significant advance in hardware realizations of probabilistic computers targeting high-throughput sampling and combinatorial optimization tasks, pushing the limits of prior single-chip designs by orders of magnitude in both system size and Gibbs sampling throughput.

Architecture and Design Principles

The principal technical innovation is the construction of a distributed sparse Ising machine (DSIM) in which p-bits and their interactions are partitioned across multiple FPGAs. Each subgraph's coupling weights are stored locally, and the only run-time communication consists of exchanging 1-bit boundary states across device boundaries at controlled rates. The architecture ensures all local computation is performed at maximal on-chip bandwidth without incurring the off-chip memory wall that limits monolithic designs.

A critical architectural insight is the decoupling of processing (local p-bit flips) and communication (boundary state exchanges). The propagation of boundary information is managed independently of local p-bit update clocks, enabling both precise experimentation and the establishment of a single dimensionless ratio, $n = f_{\mathrm{comm}} / f_{\mathrm{p\text{-}bit}}$ , as the governing parameter for distributed stochastic sampling. To mitigate the congestion and bandwidth penalties inherently present in physical non-all-to-all topologies, hardware-aware partitioning with a Potts cost function is introduced, emphasizing co-location of highly coupled nodes and minimizing long-range cut edges.

Timing Ratio as the Fundamental Scaling Law

The exhaustive empirical analysis reveals that the solution quality of the distributed Gibbs sampler, as measured by residual energy decay in 3D spin glass and other benchmarks, is not an explicit function of either the communication clock ( $f_{\mathrm{comm}}$ ) or the local p-bit update clock ( $f_{\mathrm{p\text{-}bit}}$ ) but solely their ratio $n$ . There exists a topology- and partition-dependent threshold value of $n$ above which the distributed machine is statistically indistinguishable from a monolithic GPU-based reference. This scaling collapse is a robust property, corroborated across architectures (DSIM-1 and DSIM-2), system sizes (50K–1M p-bits), and problem topologies (regular lattices, Gset Max-Cut, Pegasus/Zephyr graphs, and large irregular 3SAT instances).

When $n$ is reduced below threshold, distributed systems maintain power-law convergence but with a reduced exponent, providing a quantifiable and tunable tradeoff between throughput (sampling speed) and solution quality. Overclocked configurations with intentionally stale boundary states achieve faster times-to-easy-targets but, for sufficiently hard targets, experience crossovers where more conservative, communication-balanced schedules overtake them.

Theoretical Modeling: Universality of the Tradeoff

To separate hardware artifacts from algorithmic properties, the authors implement parallel cluster mean-field theory (CMFT) simulations that precisely mimic the partitioned Gibbs dynamics of the hardware, but on a GPU and with tunably stale boundary averages. The same functional dependency of convergence exponent on the refresh interval is retrieved, up to a monotonic mapping between the number of Monte Carlo sweeps per boundary exchange and $n$ . This result establishes the throughput-exponent tradeoff as a generic property of partitioned, parallel stochastic dynamical systems—independent of hardware substrate, random number generation, or arithmetic precision.

Empirical Performance and Benchmark Results

The DSIM-2 system scales to $10^6$ p-bits distributed across 18 high-end FPGAs, performing $\mathcal{O}(10^{12})$ flips/s with all coupling data on-chip. Statistical results align within error bounds with monolithic baselines both for Ising ground state residuals and for practical combinatorial problems, specifically:

On $L^3=373$ spin glass benchmarks, the DSIM reaches $f_{\mathrm{comm}}$ 0 residual energy at a rate up to $f_{\mathrm{comm}}$ 1 faster under overclocking relative to conservative settings.
On the G81 Max-Cut problem (20,000 spins), the DSIM attains the certified-optimal cut value, with solution statistics matching state-of-the-art heuristic CPU solvers.
On planted instances from the D-Wave Pegasus/Zephyr graphs (up to 80,800 p-bits), the DSIM achieves the planted ground state.
The architecture solves a 3SAT instance with $f_{\mathrm{comm}}$ 2 variables and 250,011 p-bits (via a sparse Ising reduction), satisfying 99.7% of clauses after $f_{\mathrm{comm}}$ 3 sweeps, with scaling comparable to a reference GPU.

Projected ASIC implementations at the 7nm node suggest that local update bandwidths (up to 100MHz, 0.66 mm $f_{\mathrm{comm}}$ 4, $f_{\mathrm{comm}}$ 5250 mW per 8,442-p-bit partition) are not limiting factors; the bottleneck is the bandwidth for boundary exchanges, but required rates are compatible with contemporary die-to-die interconnects such as UCIe and BoW.

Implications and Directions

The results provide a quantitative design rule for scaling stochastic, sampling-based computation: as long as the communication-to-local-update ratio $f_{\mathrm{comm}}$ 6 exceeds the architecture-dependent threshold (a function of coloring, congestion, and hop-distance via Eq. (2)), parallelism remains algorithmically efficient. This finding directly informs the architecture of future multi-chip, multi-die probabilistic machines—enabling independent local clocks and aggressive scaling independent of monolithic memory limits or power bottlenecks. The universality of the $f_{\mathrm{comm}}$ 7-controlled tradeoff means that designers can predict and select throughput/accuracy regimes in simulation before investing in hardware implementation.

From a computing systems perspective, probabilistic computing architectures such as the DSIM fundamentally differ from deterministic accelerators. Because the underlying stochastic dynamics are robust to boundary staleness—in the manner of asynchronous Hogwild-style Gibbs sampling—the scaling bottleneck shifts from strict synchrony to a soft tradeoff, opening new flexibilities in architectural choices and system scaling. This robustness is leveraged to extend the envelope of tractable problem sizes for sampling-based inference, optimization, and beyond.

The generality of the partition/communication scaling law should have direct relevance for the integration of nanomagnetic p-bits, monolithic 3D integration, and mesh-based interconnect scaling, and is poised to influence the architectural design of next-generation hardware for inference, sampling, and possibly neural emulation.

Conclusion

This work establishes both a scalable architecture and a quantitative theoretical framework for distributed probabilistic computers with up to one million p-bits, overcoming the single-chip limit. The solution quality and throughput are governed by a single ratio $f_{\mathrm{comm}}$ 8, universal across hardware and algorithmic parallelization. These insights constitute a concrete basis for future hardware design and system-level scaling of probabilistic machines for large-scale inference and combinatorial optimization.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Programmable Probabilistic Computer with 1,000,000 p-bits

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper shows how to build a very large “probabilistic computer” made of simple, coin‑flip-like units called p-bits. These p-bits work together to solve tough problems by exploring many possibilities quickly. The authors connected many chips so the whole system behaves like one much bigger machine. They reached one million p-bits and ran them at over a trillion updates per second.

What questions did the researchers ask?

They focused on three simple questions:

If you split a big problem across many chips, how often do those chips need to “talk” to each other so the whole system still works like one big machine?
Can you keep almost all the data local on each chip and send only tiny messages between chips to go faster?
Will this approach solve real, hard problems (like finding low-energy states in spin-glass models, Max-Cut, and SAT) as well as a single big device?

How did they do it?

First, a few ideas in everyday language:

P-bits: Think of each p-bit like a tiny coin that flips between −1 and +1. It’s not random for no reason—its chance to be +1 or −1 depends on its neighbors. This is like a friend who tends to copy the mood of their group, but still sometimes changes their mind.
Ising model: Imagine a huge network of these coins connected by “friend or foe” links (positive or negative weights). The goal is to find a setting of all coins that makes the whole network as “calm” (low energy) as possible.
Gibbs sampling: This is the rule each coin uses to update itself—look at your neighbors, then flip with a probability that fits the situation. Do this over and over and the whole network drifts toward low-energy states.

Here’s the trick to scale up:

Splitting the network: The giant network is cut into pieces (subgraphs). Each piece sits on its own chip (an FPGA).
Keep weights local: All the connection strengths (weights) are stored on the chip that needs them. That way, each chip can update fast without constantly fetching data from off-chip memory.
Tiny boundary messages: Chips only send 1-bit states for the border p-bits that touch other chips—just “my current state is +1 or −1.” They don’t send the big weight tables back and forth.
How often to talk: The key quantity is the timing ratio $\,\eta = f_{\mathrm{comm}} / f_{\mathrm{p\text{-}bit}}\,.$
- $f_{\mathrm{p\text{-}bit}}$ is how fast each chip updates its own p-bits.
- $f_{\mathrm{comm}}$ is how often chips exchange those 1-bit border states.
- Big $\eta$ means border information is fresh; small $\eta$ means border information gets stale.

They built two systems:

DSIM‑1: 6 chips with independent clocks. This let them freely adjust how fast chips talk versus how fast they update, to test different $\eta$ values.
DSIM‑2: 18 chips on a commercial platform with a shared clock. This one hit 1,000,000 p-bits and up to 3 trillion flips per second.

They also used a smart way of splitting the network so most “busy” links stay short (nearby chips), which helps communication.

Finally, to double‑check their ideas, they made a math/algorithm model (cluster mean‑field theory) that mimics the same “update locally, exchange boundary info sometimes” behavior. This ran on a GPU and let them compare with the hardware.

What did they find?

Here are the main results, explained simply:

One ratio rules everything: The quality of solutions depends mainly on the single ratio $\,\eta\,$ (how often chips talk divided by how fast they update). If $\,\eta\,$ is big enough, the many‑chip system behaves almost exactly like a single, un-split machine.
If chips talk too rarely (small $\,\eta$ ): The system still improves over time, but more slowly. In math terms, the “residual energy” (how far you are from the best answer) drops like a power law—still going down, just with a smaller slope. That means accuracy improves, but at a slower pace.
Speed vs. accuracy tradeoff: Updating faster (higher $f_{\mathrm{p\text{-}bit}}$ ) but talking less often can reach easy targets quickly (because you do more total flips per second), but for very hard targets the slower, more accurate mode can win in the end.
Big milestone: The 18‑chip system (DSIM‑2) ran one million p-bits at about a trillion flips per second and matched a strong single‑device reference when $\,\eta\,$ was high enough.
Universal behavior: Their theory model (cluster mean‑field) showed the same patterns as the hardware. That means this tradeoff isn’t a hardware quirk—it’s a general rule for this kind of distributed, probabilistic updating.

They also demonstrated the approach on different tough problems and graphs (including Max‑Cut, D‑Wave‑style Pegasus and Zephyr topologies, and large 3SAT), always using the same “exchange only 1‑bit borders” idea.

Why does this matter?

This work gives a clear, practical rule for building very large probabilistic computers out of many smaller chips:

Design rule: Set the talk/update ratio $\,\eta\,$ high enough and your many‑chip system will behave like one big chip. If you push updates faster than communication can keep up, you’ll trade accuracy (slower improvement) for speed (more flips per second).
Plan before you build: Because the simple theory model matches the hardware, engineers can estimate the needed $\,\eta\,$ in software before investing in complex hardware.
Scales to future tech: The same idea applies to chiplet systems, 3D‑stacked chips, and ultra‑fast p-bit devices. As long as you keep most data local and send only tiny border messages often enough, you can scale to huge problems.
Bigger problems, predictable behavior: Many important tasks in optimization and probabilistic inference (including parts of AI) fit this style of computing. This paper shows how to scale them beyond a single chip while knowing exactly what tradeoffs you’re making.

In short, the authors built a million‑p‑bit machine, ran it extremely fast, and discovered a simple, powerful rule—controlled by the single ratio $\,\eta\,$ —that tells you how to scale probabilistic computers without losing their problem‑solving quality.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Ground-state certainty: For large EA instances, “putative” ground energies are defined empirically. Assess how this choice biases residual-energy exponents and time-to-target; develop or integrate certification methods (e.g., exact solvers, tight lower bounds) for larger sizes.
Generality of the η rule: The η threshold is shown for 3D EA lattices and a few demos; quantify how the threshold and decay-exponent function depend on graph topology (degree distribution, community structure), weight distributions, and boundary conditions beyond cubic lattices.
Predictive modeling of η: The conservative bound with C_max and N_color matches one mapping; derive tighter, predictive models that link η to partition metrics (cut size distribution, hop latency, buffering) across diverse interconnect topologies and validate across many problem classes.
Partitioning and topology co-design: The Potts-based, distance-aware partitioner is introduced but not comprehensively evaluated. Benchmark against state-of-the-art multi-constraint partitioners across real interconnect graphs (torus, fat-tree, Clos, chiplet meshes) and quantify gains in η, throughput, and solution quality.
Chromatic number effects: The framework assumes small N_color; analyze scenarios where graph coloring grows (irregular, high-degree graphs), and quantify how larger N_color tightens the η requirement and impacts throughput.
Dense or long-range graphs: The approach targets sparse graphs; assess feasibility and η scaling for dense or long-range-coupled problems where cut edges and shadow-weight memory may explode.
Sampling correctness vs optimization: Results focus on residual energy and exponents. Evaluate sampling fidelity (e.g., stationary distribution, autocorrelation, effective sample size, KL divergence) under stale boundaries to determine bias in probabilistic inference workloads.
Algorithmic breadth: Only (block) Gibbs-style updates, simulated annealing, and one replica-based method are tested. Investigate how η interacts with Metropolis/Glauber dynamics, parallel tempering schedules, cluster updates, message passing, or learning loops that mutate weights.
Schedule dependence: Study how η thresholds vary with annealing schedules (nonlinear, adaptive, instance-aware) and temperature regimes (near criticality where correlation lengths grow).
Adaptive communication policies: Explore dynamic η control (e.g., higher f_comm at low temperatures, boundary-activity-aware refresh, error-triggered bursts) to improve the throughput–accuracy tradeoff.
Latency vs bandwidth disentanglement: η conflates refresh frequency and end-to-end latency. Isolate and measure effects of fixed latency, multi-hop delays, and jitter on optimization quality for different network fabrics.
Heterogeneous partitions: Real graphs may be imbalanced in size/boundary load. Quantify per-partition η variability, load-balancing strategies, and global performance when the worst-case C_max dominates.
Error resilience: Overclocking failures first hit boundary paths; characterize robustness to link errors, bit flips, and clock drift. Evaluate lightweight error detection/correction or handshake protocols and their impact on η and throughput.
RNG and numeric precision: LFSR vs Philox and fixed-point vs floating-point cause small but measurable differences. Systematically quantify how RNG quality and numeric formats affect exponents, convergence, and sampling bias across instances and sizes.
Mixing-time scaling: Measure mixing/hitting times and their dependence on η and system size to complement the observed power-law decay of residual energy.
Beyond 1-bit boundaries: Evaluate whether exchanging richer boundary summaries (e.g., short histories, time-stamped states, partial averages) mitigates staleness at modest bandwidth cost, and quantify gains relative to η.
Real-device p-bits: Results use FPGA pseudo-random updates. Validate the η framework with physical stochastic devices (e.g., sMTJs) whose fluctuation statistics, variability, and analog nonidealities may alter effective update rules.
ASIC and chiplet scaling: The ASIC projection assumes UCIe-class links at 6–12 GHz. Experimentally validate end-to-end η, power/thermal limits, latency, and synchronization in multi-chiplet packaging at scale; assess how package- and board-level topology impacts C_max and η.
Energy-to-solution: Power is reported, but not energy-per-target or comparisons with GPUs/annealers across targets. Provide normalized energy-to-solution and cost-performance metrics vs η to guide architectural choices.
Communication/scheduling overheads: Quantify runtime overheads for boundary exchange (serialization, buffering, arbitration) and their impact on effective η and utilization under contention.
Finite-size scaling: Examine how the η threshold and exponent degradation scale with N (beyond 37³ and 100^3), and whether the collapse with η persists asymptotically.
Universality claim: CMFT matches hardware on tested cases, but a rigorous theory linking partitioned Gibbs dynamics, stale boundaries, and exponent degradation across arbitrary graphs/dynamics remains open.
Benchmark breadth and reproducibility: Experiments use 10 instances per size; expand instance sets, release full datasets/bitstreams, and include additional real-world graphs (e.g., ML factor graphs, logistics networks) to stress-test the η framework.
Dynamic/learned weights: The design assumes static, duplicated shadow weights. Investigate workloads with online learning or time-varying couplings where cross-partition weight updates would increase communication beyond 1-bit states.
Multi-objective tuning: Develop co-optimization methods that jointly choose partitioning, coloring, schedule, and η to meet target accuracy/latency/energy constraints under given interconnect budgets.

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper presents a programmable, distributed probabilistic computer (p-computer) that scales to 1,000,000 p-bits by networking FPGAs into a single Ising machine. The key innovation is a communication-efficient architecture that keeps all coupling weights in local on-chip memory and exchanges only 1‑bit boundary states across devices. The authors establish a universal design rule: the performance of distributed stochastic sampling is governed by a single timing ratio, η = f_comm / f_p-bit. Above a topology-dependent η threshold, the distributed machine matches a monolithic reference; below it, solution quality degrades smoothly with a quantifiable throughput–accuracy tradeoff. A parallel cluster mean-field theory (CMFT) reproduces this behavior, enabling software prediction of distributed performance before hardware deployment.

Below are practical applications derived from these findings, methods, and innovations.

Immediate Applications

These are deployable now (with existing FPGA/GPU infrastructure and standard partitioning tools), primarily for R&D, prototyping, and pilot use in industry.

Distributed Ising/QUBO sampling accelerator for hard optimization
- Sectors: logistics (vehicle routing heuristics, shift scheduling), telecommunications (graph partitioning, clustering), finance (approximate portfolio optimization), EDA (placement/routing pre-optimization), operations research (Max-Cut, Max‑CSP).
- Tools/products/workflows:
- FPGA-based DSIM appliances or cloud instances offering “Ising Sampling as a Service.”
- Compiler toolchains that map QUBO/Ising models to colored sparse graphs, apply topology-aware partitioning (METIS/KaHIP + Potts cost), and deploy on FPGA clusters.
- Integration into existing pipelines (e.g., feeding high-quality initial solutions to MIP/SAT/Max-Cut solvers or LNS workflows).
- Assumptions/dependencies:
- Problem must be representable as a sparse Ising/QUBO with acceptable overhead.
- Sufficient interconnect to meet η above the topology-dependent threshold for the desired accuracy; otherwise use η-aware time-to-target tradeoffs.
- Quality of solutions depends on annealing schedules and encodings.
High-throughput probabilistic inference and Monte Carlo sampling
- Sectors: machine learning (energy-based models, RBMs/DBMs), computer vision (MRF/CRF inference), scientific computing (Bayesian inference on graphical models).
- Tools/products/workflows:
- p-computer backends for Gibbs sampling in energy-based models and MRF/CRF-based vision pipelines.
- Hybrid CPU/GPU/p-computer sampling workflows for faster mixing or burn-in.
- Assumptions/dependencies:
- Inference tasks admit Ising-compatible formulations and benefit from Gibbs-like samplers.
- Accuracy sensitivity to η must be profiled; stochastic pipelines can often tolerate staleness.
Research platform for spin glasses and statistical mechanics at scale
- Sectors: academia (physics, CS theory), HPC labs.
- Tools/products/workflows:
- Million-spin simulations on cubic lattices or other sparse topologies to study scaling laws and phase behavior.
- Use CMFT to pre-screen partitioning and communication schedules, then validate on DSIM.
- Assumptions/dependencies:
- Benchmarks and annealing schedules must be carefully controlled to ensure fair comparisons to monolithic baselines.
Cross-validation and benchmarking for quantum and Ising hardware
- Sectors: quantum computing industry, benchmarking bodies.
- Tools/products/workflows:
- Use DSIM to cross-validate results on Pegasus/Zephyr graphs (D‑Wave-native) and planted instances.
- Establish shared benchmarks and protocols where η and partition details are reported alongside results.
- Assumptions/dependencies:
- Consistent encodings, temperature schedules, and RNG choices are needed to compare fairly across platforms.
SAT, Max-Cut, and graph optimization heuristics in existing toolchains
- Sectors: EDA (formal verification assistance), network design, operations research.
- Tools/products/workflows:
- DSIM used to generate high-quality candidate solutions (e.g., Gset Max-Cut G81 optimum achieved) that seed exact solvers or screening pipelines.
- CNF-to-Ising transforms for SAT instances, with partial solves to accelerate clause pruning.
- Assumptions/dependencies:
- Not all SAT/Max‑CSP instances map efficiently; clause-to-Ising encodings may incur overhead.
- For near-threshold random SAT, high-percentage satisfaction can be achieved, but guarantees require integration with exact verification.
η-based design rule and pre-silicon performance prediction
- Sectors: semiconductor design, chiplet systems, HPC architecture, neuromorphic computing.
- Tools/products/workflows:
- Use η = f_comm / f_p-bit and the congestion bound to set device clocks and link budgets for multi-die designs.
- CMFT-based software emulation to predict throughput–accuracy tradeoffs before hardware builds.
- Assumptions/dependencies:
- Partitioning quality (cut size, hop distances, coloring) and platform topology strongly affect the feasible η; Potts-aware mapping mitigates long-hop congestion.
Topology-aware partitioning and mapping for distributed accelerators
- Sectors: EDA, FPGA design, system integration.
- Tools/products/workflows:
- Adopt Potts cost augmentation to standard partitioners to co-optimize logical min-cut and physical link distances.
- Deploy in proFPGA-like platforms and multi-SLR mappings to place high-traffic boundaries on short-hop links.
- Assumptions/dependencies:
- Requires visibility into physical interconnect graphs and link constraints; benefits are topology-dependent.
Education and training in distributed stochastic computing
- Sectors: academia, training programs.
- Tools/products/workflows:
- Hands-on labs for distributed Gibbs sampling, Ising model encodings, partitioning, and η tuning.
- Assumptions/dependencies:
- Access to FPGA/GPU resources and open-source code/data (as promised by the authors) for reproducibility.

Long-Term Applications

These require further research, scaling, ASIC integration, or ecosystem development before widespread deployment.

Chiplet-based probabilistic coprocessors for data centers and HPC
- Sectors: cloud, HPC, semiconductor.
- Tools/products/workflows:
- ASIC p-bit tiles (e.g., projected 7 nm at 100 MHz per partition) networked via UCIe/BoW, delivering multi‑trillion flips/s at improved energy efficiency.
- Runtime schedulers that dynamically tune η (via DVFS on compute or adaptive link rate) to meet SLA targets for accuracy vs latency.
- Assumptions/dependencies:
- Die-to-die links meeting multi‑GHz effective boundary-exchange rates; mature chiplet ecosystems.
- Robust SW stack (compilers, profilers, autotuners) and standardized APIs.
Ultra‑low‑power p-bit devices (e.g., sMTJ) for edge optimization and inference
- Sectors: IoT, robotics, mobile.
- Tools/products/workflows:
- Nanosecond, sub‑femtojoule sMTJ-based p-bit arrays integrated with CMOS for on-device scheduling, routing, and probabilistic control under uncertainty.
- η-centric co-design linking device fluctuation rates to inter-die bandwidth in stacked/3D packages.
- Assumptions/dependencies:
- Manufacturing maturity of stochastic MTJs and reliable integration with BEOL processes.
- Packaging/integration to sustain required boundary refresh rates with low latency.
Scalable industrial solvers for SAT/Max‑CSP/graph problems
- Sectors: EDA, verification, cybersecurity, logistics.
- Tools/products/workflows:
- Co-solver architectures where p-computers deliver candidate solutions that are polished or certified by exact solvers.
- Domain‑specific encodings (higher-order terms, penalty shaping) and adaptive schedules combining parallel tempering, cluster moves, and η‑aware orchestration.
- Assumptions/dependencies:
- High-quality encodings that preserve problem structure while maintaining sparsity.
- Verification and certification pipelines to ensure correctness when needed.
Probabilistic AI accelerators for energy-based models and MRF/CRF pipelines
- Sectors: AI/ML, vision, scientific ML.
- Tools/products/workflows:
- Hardware-accelerated negative phase sampling for training EBMs or RBMs at scale.
- Real-time MRF/CRF inference for imaging and segmentation (medical, satellite, industrial inspection).
- Assumptions/dependencies:
- Renewed adoption of energy-based modeling in production stacks.
- Efficient mappings to sparse Ising formulations; tolerance to η-driven tradeoffs.
Real-time resource optimization in networks and grids
- Sectors: telecommunications (6G RAN), power systems (grid reconfiguration, unit commitment approximations), datacenter scheduling.
- Tools/products/workflows:
- DSIM/ASIC backends integrated into controllers to solve approximate combinatorial subproblems (e.g., channel assignment, cell clustering) in near-real-time.
- Control-plane policies that exploit η to trade solution quality for response time during transients.
- Assumptions/dependencies:
- Valid Ising approximations of domain problems and predictable behavior under bounded staleness.
- Determinism requirements may limit direct use; hybrid control schemes likely needed.
3D-integrated probabilistic stacks and wafer-scale systems
- Sectors: advanced packaging, heterogeneous integration.
- Tools/products/workflows:
- Vertical integration of p-bit layers with short inter-die links to push η beyond current limits without sacrificing clock rates.
- Co-design of thermal, signal integrity, and synchronization to maintain boundary freshness.
- Assumptions/dependencies:
- Manufacturing yield and thermal design for dense, stochastic compute fabrics.
- Standards and toolchains for mapping across 3D fabrics.
Standardization, benchmarking, and procurement policies
- Sectors: standards bodies, government labs, industry consortia.
- Tools/products/workflows:
- Define reporting conventions for η, partition statistics (cut size, hop distribution, N_color), and congestion metrics alongside performance.
- Benchmark suites spanning spin glasses, Max‑Cut (e.g., Gset), SAT, Pegasus/Zephyr graphs for cross-platform comparison.
- Assumptions/dependencies:
- Community consensus on metrics and open datasets; vendor participation.
Compiler and runtime ecosystems for constraint-to-Ising mapping at scale
- Sectors: software, EDA, AI.
- Tools/products/workflows:
- High-level languages/DSLs that compile constraints to Ising with automated sparsification, partitioning, and η-aware scheduling.
- Autotuning runtimes that adapt communication frequency, temperature schedules, and cluster moves to user targets (time-to-target, energy-to-target).
- Assumptions/dependencies:
- Mature libraries for robust, domain‑specific encodings and automated validation.
- Integration with mainstream toolchains (e.g., OR-Tools, PyTorch, EDA flows).
Urban planning, mobility, and supply-chain optimization pilots
- Sectors: public sector, transportation, retail.
- Tools/products/workflows:
- Pilot deployments for high-frequency re-optimization (e.g., micro‑routing, hub clustering) where near‑optimal solutions suffice and η can be tuned to latency constraints.
- Assumptions/dependencies:
- Problem formulations that maintain sparsity; data governance and interoperability with existing systems.
- Careful validation to ensure stability and fairness in decisions.

Notes across both categories:

Feasibility is highest for sparse graphs or problems that can be sparsified; dense problems may require embedding/multiplexing with additional overhead.
Achieving “monolithic-equivalent” behavior depends on meeting η above a topology-dependent threshold (influenced by partitioning, coloring, and link budgets).
The CMFT-based prediction workflow reduces risk by estimating performance before hardware investment, but accuracy hinges on alignment between algorithmic and hardware partitions/schedules.
Energy/performance claims vs GPUs/CPUs will depend on mature ASICs and end-to-end system integration; current FPGA results demonstrate functionality and scaling rather than definitive cost/performance superiority.

View Paper Prompt View All Prompts

Glossary

Adaptive parallel tempering: A replica-based Monte Carlo technique that runs multiple replicas at different temperatures and swaps them to enhance exploration. "Running adaptive parallel tempering with isoenergetic cluster moves~\cite{chowdhury2025pushing} on DSIM-1"
Annealing schedule: The prescribed sequence of inverse temperatures used during simulated annealing. "with identical partitioning, instances, and annealing schedule (Supplementary Sec.~\ref{sec:CMFT})."
ASAP7 PDK: A predictive 7 nm process design kit used for ASIC modeling and projections. "A representative partition implemented in a 7~nm predictive process (ASAP7 PDK~\cite{clark2016asap7}) closes timing at 100~MHz"
BoW: Bunch of Wires; a die-to-die interconnect standard for high-bandwidth chiplet links. "such as UCIe~\cite{ucie_spec} and BoW~\cite{ardalan2020bunch}"
C_max (worst-case congestion metric): A metric capturing the maximum communication congestion due to boundary sizes, hop distances, and pins, used to bound feasible update clocks. "a worst-case congestion metric $C_{\max}$ combined with the coloring schedule bounds the feasible local update clock"
Cluster mean-field theory (CMFT): An approximation that runs exact local dynamics within clusters while exchanging mean-field boundary information at intervals. "The same behavior emerges from cluster mean-field theory (CMFT)~\cite{oguchi1951statistics,bethe1935statistical,pelizzola2005cluster,yamamoto2009ccmf,xing2012gmf}"
Distributed sparse Ising machine (DSIM): A multi-device Ising architecture that partitions a sparse graph across hardware, keeps weights on-chip, and exchanges only boundary states. "Distributed sparse Ising machines (DSIMs)."
Duplex boundary exchange: A communication scheme duplicating cut-edge weights on both sides so only 1-bit boundary states are exchanged bidirectionally. "(d) Duplex boundary exchange: cut-edge weights are duplicated as shadow weights on both sides of each cut, so only 1-bit boundary states ever cross device boundaries."
Edwards–Anderson spin glass: A canonical disordered Ising model on a lattice used to study spin-glass physics and hard optimization. "A canonical benchmark is the three-dimensional Edwards--Anderson (EA) spin glass~\cite{edwards1975theory}, whose ground-state search is NP-hard~\cite{barahona1982computational}"
Gibbs sampling: A Markov chain Monte Carlo method that updates variables from their conditional distributions. "The machine performs Gibbs sampling at over a trillion flips per second"
Graph coloring: A scheduling technique that partitions nodes into color groups so non-conflicting updates can proceed in parallel. "updated in parallel through graph coloring so that capacity and throughput grow with every added device"
Hogwild-style parallel Gibbs sampling: An asynchronous sampling approach that tolerates stale reads across threads without locks. "Hogwild-style parallel Gibbs sampling survives stale reads across threads~\cite{johnson2013analyzing}"
Isoenergetic cluster moves: Monte Carlo updates that flip clusters without changing energy to accelerate equilibration. "adaptive parallel tempering with isoenergetic cluster moves~\cite{chowdhury2025pushing}"
KaHIP: A high-quality graph partitioning suite used to produce balanced cuts. "METIS~\cite{karypis1998software} or KaHIP~\cite{Sanders2013KaHIP}"
METIS: A widely used graph partitioning tool for min-cut and balancing. "METIS~\cite{karypis1998software} or KaHIP~\cite{Sanders2013KaHIP}"
Min-cut partition: A graph partition minimizing the number of cut edges while balancing partition sizes. "A balanced min-cut partition, obtained with standard tools such as METIS~\cite{karypis1998software} or KaHIP~\cite{Sanders2013KaHIP}, keeps the number of cut edges small"
Monolithic GPU reference: A single-device, unpartitioned baseline used for accuracy comparison. "Above a topology-dependent threshold, the distributed machine matches a monolithic GPU reference."
Monte Carlo sweeps (MCS): Units of work where each variable is updated once; used to measure progress in MCMC. "a fixed budget of $10^6$ Monte Carlo sweeps (MCS) per run"
Overclocking: Operating hardware beyond its verified timing limits to increase throughput at the risk of timing violations. "overclocking beyond timing closure, which lowers the effective ratio, reproduces the predicted speed--accuracy tradeoff."
Pegasus P41: A sparse hardware-native topology used by D-Wave quantum annealers. "On the Pegasus P41 and Zephyr Z50 graphs native to current and next-generation D-Wave quantum annealers"
Potts cost function: A partitioning objective that penalizes placing strongly coupled subgraphs far apart in the physical topology. "a Potts cost function penalizes placing strongly connected partitions on distant devices"
p-bit: A probabilistic bit that fluctuates between two states with tunable probability, used as a stochastic computing primitive. "p-bits, stochastic units that fluctuate between two states with tunable probability~\cite{camsari2019p,camsari2017stochastic,kaiser2021probabilistic,camsari2015modular,camsari2017implementing,borders2019integer,kaiser2019subnanosecond}"
Probabilistic computer (p-computer): A computing paradigm built from p-bits to perform sampling and optimization. "Our vehicle is the probabilistic computer: p-bits, stochastic units that fluctuate between two states with tunable probability"
Putative ground energy: The best-known energy used as a surrogate for the true ground state when the exact minimum is unknown. "with $E^f$ the final energy of a run and $E_{\mathrm{ground}$ a putative ground energy (Methods)"
Replica-based Monte Carlo algorithms: Methods that run multiple replicas (often at different temperatures) to overcome energy barriers. "replica-based Monte Carlo algorithms have recently matched and exceeded these scaling exponents on the same instances~\cite{chowdhury2025pushing}"
Residual energy per spin: The energy above the (putative) ground state normalized by system size. "The final residual energy per spin, $\rho_E^f=(E^f-E_{\mathrm{ground})/N$"
Shadow weights: Duplicated copies of cut-edge couplings stored on both sides of a partition boundary. "cut-edge weights are duplicated as shadow weights on both sides of each cut"
Simulated annealing: A stochastic optimization technique that lowers an effective temperature according to a schedule to escape local minima. "EA results use simulated annealing with $\beta=0.5,1.0,\ldots,5.0$ "
Simulated bifurcation machine: A specialized analog/digital Ising solver based on dynamical systems that emulate bifurcation behavior. "simulated bifurcation machines face the same ceiling~\cite{goto2019combinatorial,goto2021high}"
Source-synchronous duplex links: Communication links that forward a clock with data in both directions to align transfers across devices. "with independent local clocks and source-synchronous duplex links (Supplementary Sec.~\ref{sec:bus_chain_app} and Fig.~\ref{fig:supp_dsim1_photo})"
Spin Hamiltonian: The energy function of an Ising/spin system specifying interactions and fields. "find low-energy states of spin Hamiltonians and, more generally, sample from their Boltzmann distributions"
Stochastic magnetic tunnel junction (sMTJ): A nanoscale device that exhibits random telegraph switching usable as a physical p-bit. "sub-femtojoule stochastic magnetic tunnel junctions~\cite{kaiser2019subnanosecond,singh2024cmos}"
Super Logic Region (SLR): A physical subdivision within large FPGAs used for hierarchical placement and partitioning. "across the Super Logic Regions (SLRs) within each FPGA as well as across FPGAs"
Timing closure: The state in hardware design where all timing constraints are satisfied at a given clock frequency. "overclocking beyond timing closure"
Time-to-target: The wall-clock time required to reach a specified solution quality threshold. "Time-to-target at one million p-bits ( $L^3=100^3$ , DSIM-2)."
UCIe (Universal Chiplet Interconnect Express): A standard for high-speed die-to-die interconnects used to link chiplets. "Universal Chiplet Interconnect Express (UCIe)~\cite{ucie_spec}"
Zephyr Z50: A sparse hardware-native topology for next-generation D-Wave quantum annealers. "On the Pegasus P41 and Zephyr Z50 graphs native to current and next-generation D-Wave quantum annealers"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

HackerNews

Programmable Probabilistic Computer with 1M p-bits (55 points, 1 comment)

Programmable Probabilistic Computer with 1,000,000 p-bits

Summary

Programmable Probabilistic Computing with One Million p-bits: Architecture, Scaling, and Theoretical Principles

Introduction

Architecture and Design Principles

Timing Ratio as the Fundamental Scaling Law

Theoretical Modeling: Universality of the Tradeoff

Empirical Performance and Benchmark Results

Implications and Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

What did they find?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research