CMX Architecture: Integrated Memory and Fusion
- CMX architecture is a hardware-software co-design paradigm that fuses memory-centric computation with sensor data fusion for neural inference and quantum algorithms.
- It employs advanced techniques like analog matrix–vector multiplication, polyhedral compilation, and pipelined execution to maximize throughput and ensure computation correctness.
- The framework supports flexible programming models and durable linearizability, enabling efficient multi-modal segmentation and quantum energy estimation.
The CMX architecture designates a class of computational memory accelerators and fusion frameworks that integrate hardware and software for memory-centric computation and sensor fusion, with articulated support for neural network inference, quantum chemistry algorithms, and robust multi-modal perception. This term encompasses architectures where computational primitives are executed directly within or near memory structures, and systems where modality fusion is performed for semantic segmentation across diverse sensors. The CMX paradigm unifies hardware (crossbar arrays, digital processors, SRAM, and disaggregated memory pools) and software (compiler interfaces, dataflow graph mapping, and durable programming models), supported by formal operational semantics and advanced compilation strategies.
1. Hardware–Software Co-Design
The computational memory (CM) accelerator diverges from traditional architectures by integrating crossbar arrays (XBARs) for analog matrix–vector multiplication, lightweight digital processing units (DPUs) for operations such as activation and pooling, and local SRAM (MEM) for input/output activations and intermediate data. The software interface targets frameworks like ONNX, TensorFlow, or PyTorch, partitioning the neural network's dataflow graph such that each convolution or similar operation is mapped to an individual CM core. Partitioning is guided by the hardware’s memory, connectivity constraints, and explicit exposure of the interconnect topology. The compiler "lowers" the NN graph and issues configuration, instruction, and control sequences for the global/local control units and DPU, resulting in a system where every core executes in close coordination with its neighbors.
2. Dataflow Engine and Pipeline Architecture
The CMX architecture programs the accelerator as an explicit dataflow engine. Each core aligns with a node in the pipeline corresponding to a neural network layer. Execution is cyclical: the local control unit (LCU) loads activations from SRAM to the crossbar, performs MxV multiplication in analog, then the DPU carries out additional transformations. Cores are linked so that outputs of one serve as inputs to the next, with data dependencies mapped directly to control vectors and handshakes via the accelerator’s exposed interconnect. This structure enables pipelined, overlapped execution, maximizing throughput and minimizing latency by ensuring concurrent work across layers during every cycle.
3. Polyhedral Control Logic and Correctness
Data dependency management and execution order enforcement are facilitated via polyhedral compilation techniques. Computations and data accesses in each core are modeled as nested loop nests with affine bounds, permitting rigorous mapping of producer–consumer relations. Affine relations (e.g., R₂ for reading from iterations in J, W₁ for writing from I) enable precise coordination:
where maps array locations to the maximal iteration in J for safe execution, and encodes order constraints. State machines generated via this model ensure that each core computes only when all precursor data are available, protecting correctness and avoiding hazards. This mechanism is realized in both local and global control units and is integral to safe, efficient pipelined computation.
4. Multi-Core Accelerator and Disaggregated Memory
A complete CMX accelerator integrates multiple computational memory cores, global control logic (GCU), and global memory buffers (GMEM) to interface with external systems. Inter-core connections—and the topology thereof—are exposed to the compiler for optimal mapping of NN graph partitions. This supports pipelining across specialized cores, concurrent execution, and flexible routing for both shallow and deep networks. Data transfers can traverse direct, indirect, or routed interconnects, enhancing both throughput and adaptability. In contexts employing CXL disaggregated memory (as in the CXL0 programming model (Assa et al., 23 Jul 2024)), CMX systems become accessible to general-purpose algorithms with durable persistence guarantees and operational semantics ensuring correct recovery from crashes or partial failures.
5. Fusion Framework for RGB-X Semantic Segmentation
In a distinct but convergent application domain, CMX also refers to a transformer-based fusion framework for RGB-X semantic segmentation (Zhang et al., 2022). Architecturally, it is modality-agnostic, incorporating a two-stream backbone for RGB and complementary sensors (depth, thermal, polarization, event, LiDAR). Key modules include:
- Cross-Modal Feature Rectification Module (CM-FRM): Rectifies features using channel-wise (via pooled attention vectors and MLP) and spatial-wise (through 1×1 convolutions) mechanisms:
- Feature Fusion Module (FFM): Exchanges information by flattening feature maps, applying cross-attention based on global context vectors, and recombines local and global cues for robust segmentation.
- Supported Modalities: Depth (HHA represention), thermal (infrared), polarization (DoLP/AoLP), event camera data (voxel grid representation), and LiDAR (range view projections).
Benchmarks demonstrate state-of-the-art mIoU performance across nine datasets, with mIoU improvements of 3–4% on select benchmarks. The fusion architecture efficiently suppresses sensor noise and exploits cross-modal complementarity for challenging scenes.
6. Programming Models and Durable Transformations
Within memory-disaggregated CMX systems, the CXL0 programming model introduces abstractions for cache-coherent shared memory (LStore, RStore, MStore) and flush operations (LFlush, RFlush) (Assa et al., 23 Jul 2024). Operational semantics guarantee that cached values across machines are consistent:
A suite of code transformations such as replacing volatile stores with MStore or RStore paired with RFlush, and leveraging the FliT transformation, upgrade conventional linearizable algorithms to guarantee crash durability (durable linearizability). This formal correctness is critical for CMX deployments in persistent and disaggregated memory environments.
7. Quantum Algorithm Acceleration via Connected Moments Expansion
In quantum computation, CMX denotes the "Connected Moments Expansion" architecture (Claudino et al., 2021). Accurate ground-state energies are determined via recursive measurement of Hamiltonian connected moments for a trial state :
with moments recursively defined by
Low-order truncations ( and above) converge to the exact energy under suitable overlap conditions. Coupled with ADAPT-VQE state preparation, measurement caching (of Pauli terms across powers of ), and thresholding (discarding negligible coefficients), CMX enables efficient, accurate inference within hardware limits.
Conclusion
CMX architecture embodies a paradigm of hardware–software co-design for computational memory accelerators, robust fusion frameworks, and programming models for memory disaggregation. By leveraging advanced compilation (polyhedral techniques), operational semantics, and adaptive fusion strategies, CMX systems achieve efficient, durable, and generalizable performance across neural inference, quantum computation, and multi-modal perception. The principle of directly encoding computation into memory and fusing signals through formalized, channel– and spatial-wise mechanisms delineates a technically rigorous, extensible pathway for high-performance intelligent systems.