Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 38 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 466 tok/s Pro
Kimi K2 243 tok/s Pro
2000 character limit reached

CMX Architecture: Integrated Memory and Fusion

Updated 19 August 2025
  • CMX architecture is a hardware-software co-design paradigm that fuses memory-centric computation with sensor data fusion for neural inference and quantum algorithms.
  • It employs advanced techniques like analog matrix–vector multiplication, polyhedral compilation, and pipelined execution to maximize throughput and ensure computation correctness.
  • The framework supports flexible programming models and durable linearizability, enabling efficient multi-modal segmentation and quantum energy estimation.

The CMX architecture designates a class of computational memory accelerators and fusion frameworks that integrate hardware and software for memory-centric computation and sensor fusion, with articulated support for neural network inference, quantum chemistry algorithms, and robust multi-modal perception. This term encompasses architectures where computational primitives are executed directly within or near memory structures, and systems where modality fusion is performed for semantic segmentation across diverse sensors. The CMX paradigm unifies hardware (crossbar arrays, digital processors, SRAM, and disaggregated memory pools) and software (compiler interfaces, dataflow graph mapping, and durable programming models), supported by formal operational semantics and advanced compilation strategies.

1. Hardware–Software Co-Design

The computational memory (CM) accelerator diverges from traditional architectures by integrating crossbar arrays (XBARs) for analog matrix–vector multiplication, lightweight digital processing units (DPUs) for operations such as activation and pooling, and local SRAM (MEM) for input/output activations and intermediate data. The software interface targets frameworks like ONNX, TensorFlow, or PyTorch, partitioning the neural network's dataflow graph such that each convolution or similar operation is mapped to an individual CM core. Partitioning is guided by the hardware’s memory, connectivity constraints, and explicit exposure of the interconnect topology. The compiler "lowers" the NN graph and issues configuration, instruction, and control sequences for the global/local control units and DPU, resulting in a system where every core executes in close coordination with its neighbors.

2. Dataflow Engine and Pipeline Architecture

The CMX architecture programs the accelerator as an explicit dataflow engine. Each core aligns with a node in the pipeline corresponding to a neural network layer. Execution is cyclical: the local control unit (LCU) loads activations from SRAM to the crossbar, performs MxV multiplication in analog, then the DPU carries out additional transformations. Cores are linked so that outputs of one serve as inputs to the next, with data dependencies mapped directly to control vectors and handshakes via the accelerator’s exposed interconnect. This structure enables pipelined, overlapped execution, maximizing throughput and minimizing latency by ensuring concurrent work across layers during every cycle.

3. Polyhedral Control Logic and Correctness

Data dependency management and execution order enforcement are facilitated via polyhedral compilation techniques. Computations and data accesses in each core are modeled as nested loop nests with affine bounds, permitting rigorous mapping of producer–consumer relations. Affine relations (e.g., R₂ for reading from iterations in J, W₁ for writing from I) enable precise coordination:

S=lexmax{(W1(K(D))1)}\mathcal{S} = \text{lexmax}\{(W_1(\mathcal{K}(D'))^{-1})\}

where S\mathcal{S} maps array locations to the maximal iteration in J for safe execution, and DD' encodes order constraints. State machines generated via this model ensure that each core computes only when all precursor data are available, protecting correctness and avoiding hazards. This mechanism is realized in both local and global control units and is integral to safe, efficient pipelined computation.

4. Multi-Core Accelerator and Disaggregated Memory

A complete CMX accelerator integrates multiple computational memory cores, global control logic (GCU), and global memory buffers (GMEM) to interface with external systems. Inter-core connections—and the topology thereof—are exposed to the compiler for optimal mapping of NN graph partitions. This supports pipelining across specialized cores, concurrent execution, and flexible routing for both shallow and deep networks. Data transfers can traverse direct, indirect, or routed interconnects, enhancing both throughput and adaptability. In contexts employing CXL disaggregated memory (as in the CXL0 programming model (Assa et al., 23 Jul 2024)), CMX systems become accessible to general-purpose algorithms with durable persistence guarantees and operational semantics ensuring correct recovery from crashes or partial failures.

5. Fusion Framework for RGB-X Semantic Segmentation

In a distinct but convergent application domain, CMX also refers to a transformer-based fusion framework for RGB-X semantic segmentation (Zhang et al., 2022). Architecturally, it is modality-agnostic, incorporating a two-stream backbone for RGB and complementary sensors (depth, thermal, polarization, event, LiDAR). Key modules include:

RGBout=RGBin+λCRGBrecC+λSRGBrecS\text{RGB}_{\text{out}} = \text{RGB}_{\text{in}} + \lambda_C \cdot \text{RGB}_{\text{rec}}^C + \lambda_S \cdot \text{RGB}_{\text{rec}}^S

  • Feature Fusion Module (FFM): Exchanges information by flattening feature maps, applying cross-attention based on global context vectors, and recombines local and global cues for robust segmentation.
  • Supported Modalities: Depth (HHA represention), thermal (infrared), polarization (DoLP/AoLP), event camera data (voxel grid representation), and LiDAR (range view projections).

Benchmarks demonstrate state-of-the-art mIoU performance across nine datasets, with mIoU improvements of 3–4% on select benchmarks. The fusion architecture efficiently suppresses sensor noise and exploits cross-modal complementarity for challenging scenes.

6. Programming Models and Durable Transformations

Within memory-disaggregated CMX systems, the CXL0 programming model introduces abstractions for cache-coherent shared memory (LStore, RStore, MStore) and flush operations (LFlush, RFlush) (Assa et al., 23 Jul 2024). Operational semantics guarantee that cached values across machines are consistent:

i,j,xLoc,(Cachei(x)Cachej(x))Cachei(x)=Cachej(x)\forall i, j, \forall x \in Loc, (\text{Cache}_i(x) \neq \perp \land \text{Cache}_j(x) \neq \perp) \Rightarrow \text{Cache}_i(x) = \text{Cache}_j(x)

A suite of code transformations such as replacing volatile stores with MStore or RStore paired with RFlush, and leveraging the FliT transformation, upgrade conventional linearizable algorithms to guarantee crash durability (durable linearizability). This formal correctness is critical for CMX deployments in persistent and disaggregated memory environments.

7. Quantum Algorithm Acceleration via Connected Moments Expansion

In quantum computation, CMX denotes the "Connected Moments Expansion" architecture (Claudino et al., 2021). Accurate ground-state energies are determined via recursive measurement of Hamiltonian connected moments for a trial state Φ|\Phi\rangle:

E(t)=ΦHetHΦΦetHΦ=k(t)kk!Ik+1E(t) = \frac{\langle \Phi|H e^{-tH}|\Phi \rangle}{\langle \Phi|e^{-tH}|\Phi \rangle} = \sum_k \frac{(-t)^k}{k!} I_{k+1}

with moments recursively defined by

Ik=ΦHkΦi=0k2(k1i)Ii+1ΦHki1ΦI_k = \langle \Phi|H^k|\Phi \rangle - \sum_{i=0}^{k-2} {k-1 \choose i} I_{i+1} \langle \Phi|H^{k-i-1}|\Phi \rangle

Low-order truncations (CMX(2)\text{CMX}(2) and above) converge to the exact energy under suitable overlap conditions. Coupled with ADAPT-VQE state preparation, measurement caching (of Pauli terms across powers of HH), and thresholding (discarding negligible coefficients), CMX enables efficient, accurate inference within hardware limits.

Conclusion

CMX architecture embodies a paradigm of hardware–software co-design for computational memory accelerators, robust fusion frameworks, and programming models for memory disaggregation. By leveraging advanced compilation (polyhedral techniques), operational semantics, and adaptive fusion strategies, CMX systems achieve efficient, durable, and generalizable performance across neural inference, quantum computation, and multi-modal perception. The principle of directly encoding computation into memory and fusing signals through formalized, channel– and spatial-wise mechanisms delineates a technically rigorous, extensible pathway for high-performance intelligent systems.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube