Papers
Topics
Authors
Recent
Search
2000 character limit reached

GEM3D CIM General Purpose Matrix Computation Using 3D Integrated SRAM eDRAM Hybrid Compute In Memory on Memory Architecture

Published 15 Apr 2026 in cs.AR | (2604.13969v1)

Abstract: With the rapid growth of deep neural networks (DNNs), compute-in-memory (CIM) has emerged as a promising energy-efficient paradigm for accelerating multiply-and-accumulate (MAC) operations. Yet, current CIM architectures are largely limited to dot-product computations and struggle to efficiently support general-purpose matrix operations, such as transpose, element-wise addition, and multiplication. This work presents a 3D-integrated, memory-on-memory SRAM-eDRAM hybrid CIM architecture, implemented in GlobalFoundries 22~nm FDSOI technology, capable of performing general matrix operations directly within the memory crossbar with 4-bit precision. By leveraging a specialized transpose-based architecture, in-memory arithmetic operations, peripheral-aware design, and 3D SRAM--eDRAM integration, the proposed architecture balances latency, energy efficiency, and compute density for general purpose matrix operations while remaining compatible with the conventional CIM dot product architectures. Overall, this memory-on-memory CIM framework generalizes CIM beyond dot products, enabling versatile matrix processing and paving the way for broader applications in AI acceleration and general-purpose high performance computing.

Summary

  • The paper introduces a novel GEM3D-CIM architecture that extends compute-in-memory from dot products to full matrix operations, including transpose, multiplication, and addition.
  • It utilizes monolithic 3D integration of SRAM and eDRAM with specialized bit-cell designs and routing to achieve high throughput, minimal latency, and improved energy efficiency.
  • Extensive simulations and numerical analyses validate the design's performance, demonstrating significant improvements over traditional CIM systems for AI and high-performance computing workloads.

General-Purpose Matrix Computation with GEM3D-CIM: A Monolithic 3D SRAM-eDRAM Compute-in-Memory Architecture

Introduction and Motivation

The "GEM3D-CIM" architecture (2604.13969) tackles a fundamental limitation in existing compute-in-memory (CIM) systems: the restriction to dot-product (MAC) operations, which impedes efficient support for general matrix transformations and arithmetic. The architecture introduces a monolithic 3D-integrated hybrid of SRAM and eDRAM arrays that enables not only dot product but also matrix transpose, element-wise (Hadamard) multiplication, and addition directly within the crossbar at 4-bit precision, supporting dense, high-throughput analog and digital processing.

The stacking of SRAM (Layer A) and eDRAM (Layer B) through fine-pitch monolithic vias achieves high density, bandwidth, and low-latency inter-tier communication, setting the stage for versatile in-memory processing suitable for AI acceleration and scientific computing workloads where elementwise and structural matrix operations are critical. Figure 1

Figure 1: The proposed architecture leverages monolithic 3D integration of a top-tier SRAM array and a bottom-tier eDRAM array, partitioned into functional sub-arrays for in-memory transpose, multiplication, and accumulation.

Architectural Design and Circuit Innovations

Hybrid Bit-Cell Array Structure

The GEM3D-CIM design implements specialized bit-cells for each class of operation. For matrix transpose, both transposable SRAM (T-SRAM) and transposable eDRAM (T-eDRAM) bit-cells introduce supplemental circuitry—T-SRAMs add five transistors to enable dual-mode (conventional and transpose), while T-eDRAMs employ additional buffers and write logic to ensure proper isolation and integrity under diagonal reordering.

For arithmetic operations, the architecture adopts 8T MA-SRAM bit-cells serving as DACs (with weighted access transistor sizing for MSB-to-LSB digital-to-analog conversion) and 9T MA-eDRAM bit-cells supporting analog-to-digital conversion via LFSR-based counters. The strategic partitioning of sub-arrays by function avoids area penalties associated with over-integration while enhancing scalability. Figure 2

Figure 2: Layouts of transposable and arithmetic-enabled SRAM/eDRAM bit-cells used in the architected crossbar.

In-Memory Matrix Transpose

The matrix transpose operation is a three-step algorithm exploiting vertical connections between layers and custom crossbar routing. The process involves (1) transferring upper-diagonal elements from SRAM (Layer A) to eDRAM (Layer B), (2) simultaneous internal swaps within each layer, and (3) restoration of lower-diagonal elements back to Layer A. Added transmission-gate-based "blockers" facilitate controlled parallel or sequential enablement, supporting N×NN\times N transpose in N+1N+1 cycles—nearly halving the cycles required by traditional sequential architectures. Figure 3

Figure 3: (a) 3×3 transpose dataflow. (b-e) Topologies of T-SRAM, T-eDRAM sub-arrays, their specialized routing, and inter-layer interconnects for efficient parallel transpose support.

Transient simulations confirm robustness under process, voltage, and temperature variation, and the area overhead is kept minimal by specializing layout only for necessary operations. Figure 4

Figure 4: Monte Carlo transient results highlight reliable state transitions during in-memory transpose in both T-SRAM and T-eDRAM under 1000 samples.

In-Memory Element-wise Arithmetic

Analog Multiplication and Addition

To support element-wise matrix multiplication and addition, the architecture utilizes current-based DACs implemented with MA-SRAM cells and LFSR-based ADCs within MA-eDRAM. DAC precision and signal margins are boosted transiently to 1.8V supply during conversion. Element pairs are mapped adjacently for direct analog computation.

The multiplication path leverages a capacitive C2C analog multiplier, followed by a PMOS differential amplifier that transforms the analog result into a time-delay encoded as an LFSR digital output. The addition path aggregates currents in Layer A, with the analog sum similarly converted and digitized in Layer B. Calibration sequences account for comparator offset across sub-arrays. Figure 5

Figure 5: MA-SRAM and MA-eDRAM sub-array configuration for DAC, multiplication, addition, and LFSR-based ADC operation.

Figure 6

Figure 6: (a) Analog multiplication and (b) addition output behaviors as a function of operand input levels.

Linearity of the LFSR-based ADC supports effective matrix arithmetic. The architecture achieves an effective number of bits (ENOB) of 4.78, validated by extensive Monte Carlo simulation. Figure 7

Figure 7: The LFSR-based ADC maintains strong linearity for analog multiplication and addition conversions, essential for low-error in-memory arithmetic.

Performance, Energy, and Area Analysis

Numerical Results

  • Matrix transpose throughput: 15.51 GOPS (12.77 GOPS/W) for a 32×3232\times32 4-bit matrix, with a latency of 264 ns and total energy of 320.55 nJ.
  • Elementwise arithmetic: Multiplication achieves 13.93 GOPS (436.61 GOPS/W), while addition reaches 27.86 GOPS (432.25 GOPS/W) over similar matrices, with sub-microsecond latency.
  • These results significantly outperform prior CIM designs on throughput and energy efficiency for non-dot-product matrix ops. Figure 8

    Figure 8: Comprehensive energy breakdowns for transpose, multiplication, and addition operations, indicating contributions from SRAM (Layer A) and eDRAM (Layer B).

Area and Integration Considerations

  • The T-SRAM cell occupies 2.93 μm², T-eDRAM 1.04 μm², MA-eDRAM 6.36 μm² due to computational logic; all layouts based on conservative logic design rules for compatibility.
  • Area overheads for functional bit-cells are offset by savings in data transfer and system energy for general matrix operations.
  • Monolithic 3D stacking enables high interconnect density, bandwidth, and reduced energy, but introduces thermal coupling and variability challenges, which are addressed by redundancy and calibration strategies.

Implications and Future Directions

From a theoretical standpoint, GEM3D-CIM extends the computational semantics of CIM architectures from traditional MAC-centric to full matrix algebra, enabling workloads with a high density of transpositions and element-wise ops to be realized in-memory with low latency and energy. This broadens the use of CIM from DNN inference/training to scientific simulations, RNNs/GRUs, and general-purpose matrix-centric computation.

Practically, the work demonstrates that foundry-compatible, monolithic 3D integration is a feasible and robust path to high-density, high-bandwidth CIM macros amenable to future advances in BEOL transistor and vertical interconnect technologies. Open challenges include scaling to higher digital precision, mitigating thermal coupling, and further optimizing bit-cell area efficiency. As foundry support for dense inter-tier integration matures, the paradigm demonstrated here can serve as the foundation for next-generation AI accelerators and heterogeneous high-performance compute fabrics where CIM subsumes a broader class of matrix and tensor primitives.

Conclusion

GEM3D-CIM realizes a versatile, monolithic 3D SRAM-eDRAM compute-in-memory architecture that natively supports matrix transpose, elementwise multiplication, addition, and MAC within memory crossbars, showcasing strong throughput and energy metrics across kernels with rigorous transient and variation-aware validation. The architectural techniques and integration strategies implemented portend a shift toward general-purpose in-dataflow computation, expanding both the practical and theoretical boundaries of mainstream CIM and AI/HPCC acceleration platforms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (big picture)

Computers spend a lot of energy and time shuttling data back and forth between memory (where data is stored) and processors (where math happens). This paper proposes a new kind of “memory that can also compute,” so some math happens right where the data already sits. Even better, it isn’t limited to just one kind of math (dot products for neural networks). It can also do other common matrix chores—like flipping a matrix (transpose), adding two matrices, and multiplying them element-by-element—directly inside the memory.

To do this, the authors stack two kinds of memory on top of each other in 3D:

  • Layer A: fast and stable SRAM (like a quick-access shelf)
  • Layer B: dense and energy‑efficient eDRAM (like a deep storage shelf)

Tiny vertical connections between the layers let them “talk” very quickly.

What questions the paper tries to answer

  • Can we build memory that not only stores data but also performs a variety of matrix operations inside it, not just the usual dot product?
  • Can this be done efficiently (fast and low energy) using real, industry-compatible chip technology?
  • Can we design the memory so it plays nicely with normal digital systems, even if some of the math happens using analog signals?

How the researchers approached it (in simple terms)

Think of a library:

  • Normally, you check out books (data), carry them to a study room (processor), do your work (compute), then carry them back (data movement). That’s slow and tiring.
  • The authors set up study desks right between the shelves. Now you can read and write (compute) without hauling books around.

Here’s how that looks in hardware:

  • 3D stacking: Two “floors” (SRAM above, eDRAM below) are connected by lots of tiny, short “elevators.” Shorter trips mean faster and more energy‑efficient communication.
  • Special memory cells:
    • T‑SRAM and T‑eDRAM: These are “transposable” cells that can read and write along different directions, making it easy to swap positions of numbers for fast transpose (flipping rows and columns).
    • MA‑SRAM and MA‑eDRAM: These are “math‑ready” cells for element‑wise addition and multiplication.
  • Mixing digital and analog smartly:
    • Digital numbers stored in memory are turned into analog voltages (like using a dimmer switch) using the SRAM (“DAC” function).
    • After doing analog math (fast and compact), the result is turned back into digital using the eDRAM (an “ADC” function that uses a simple counter trick called an LFSR).
  • Compatibility:
    • Even though the math inside uses analog signals, both the inputs and the stored results are digital, so standard computer parts can use them.
  • Still supports dot products:
    • The same setup can also perform normal multiply‑and‑accumulate (MAC) operations used in neural networks.

What each main operation means (with analogies)

  • Matrix transpose: Like flipping a spreadsheet across its main diagonal so rows become columns. The design copies elements across layers and within the same layer in a few organized steps, finishing in about N+1 steps for an N×N matrix.
  • Element‑wise addition: Add matching entries from two matrices (like adding two pictures pixel by pixel).
  • Element‑wise multiplication (Hadamard product): Multiply matching entries from two matrices (like darkening or blending two images pixel by pixel).

What they built and tested

  • Technology: Implemented with GlobalFoundries 22 nm FDSOI (a real, industry chip process).
  • Precision: 4‑bit inputs (small numbers, enough for many AI tasks).
  • Transpose engine:
    • Special T‑SRAM and T‑eDRAM cells can copy and swap values in place.
    • Finishes transpose in N+1 cycles (faster than doing separate read/write steps).
  • Addition and multiplication engine:
    • MA‑SRAM turns stored 4‑bit numbers into analog voltages (DAC).
    • Simple analog circuits add or multiply those voltages.
    • MA‑eDRAM converts the analog result back to digital using an LFSR‑based ADC (like timing how long it takes a ramp to catch up to a voltage).
    • A quick calibration step improves accuracy; the effective accuracy is around 5 bits.

Main results and why they matter

For a 32×32 matrix (a common small test case):

  • Transpose:
    • Time: about 264 nanoseconds
    • Energy: about 320 nanojoules
    • Throughput: ~15.5 GOPS (billion operations per second)
    • Energy efficiency: ~12.8 GOPS per watt
  • Element‑wise multiplication (4‑bit × 4‑bit):
    • Time: ~588 nanoseconds
    • Energy: ~18.8 nanojoules
    • Throughput: ~13.9 GOPS
    • Energy efficiency: ~437 GOPS per watt
  • Element‑wise addition:
    • Time: ~294 nanoseconds
    • Energy: ~19.0 nanojoules
    • Throughput: ~27.9 GOPS
    • Energy efficiency: ~432 GOPS per watt

Why this is important:

  • Much less data movement: The math happens where the data lives, saving energy and time.
  • More than dot products: Many algorithms (RNNs, physics simulations, masking, tensor tricks) need transpose and element‑wise ops; this system supports them directly.
  • Works with standard chips: Inputs/outputs are digital and the design is foundry‑compatible, making it more practical to build and use.
  • 3D stacking boosts bandwidth and density: Short, dense vertical links make the two layers act like one tightly integrated calculator.

What this could change (implications)

  • Faster, greener AI hardware: Especially for tasks that use lots of matrix operations beyond just dot products.
  • Broader scientific computing on memory-centric chips: Simulations and signal processing could benefit from element‑wise operations and transposes done in place.
  • New chip designs: Shows a path to “memory-on-memory” compute blocks that mix analog math with digital storage using realistic manufacturing methods.
  • Future directions: Improve precision, scale to larger matrices, and integrate more functions while keeping energy low.

In short, the paper shows that we can turn memory from a passive “storage closet” into an active “workbench” for many common matrix operations, all while saving energy and time—an important step for faster and more efficient AI and scientific computing.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and open questions that future work could address to strengthen and generalize the proposed GEM3D-CIM architecture.

  • Precision and dynamic range limits: All core matrix operations are demonstrated at 4-bit inputs with an LFSR-ADC ENOB of ~4.78 bits; it remains unclear how to scale to higher bit depths (e.g., 8–12 bits) without prohibitive energy/area overhead, and how quantization/linearity errors impact end-to-end workloads.
  • Inconsistent product resolution: Element-wise 4b×4b multiplication theoretically yields 8-bit products, yet the architecture reports converting to a 6-bit digital result (via 8-bit LFSR code). The rationale, precision loss, and application-level accuracy impact are not quantified.
  • ADC architecture trade-offs: The choice of LFSR-based time-to-digital conversion (64-cycle) is not compared against alternatives (e.g., SAR, flash, cyclic) for speed, area, linearity, and energy across PVT; no path is provided to reduce ADC cycles or pipeline ADC for higher throughput.
  • DAC/ADC non-idealities: Comprehensive INL/DNL, noise, jitter, ramp-generator distribution error, and inter-word mismatch analyses are missing; only limited Monte Carlo/linearity plots are shown without bit-error rates or failure probabilities at array scale.
  • Comparator calibration robustness: The proposed per-word comparator offset calibration lacks details on storage overhead, frequency of recalibration, drift over temperature/aging, calibration convergence time, and its effect on throughput and energy at scale.
  • C2C multiplier variability: Capacitor mismatch, parasitic coupling, and temperature dependence for the capacitive multiplier are not quantified; no calibration/trim strategy is proposed to maintain linearity across PVT and across large arrays.
  • Analog signal distribution across tiers: The impact of monolithic inter-tier via (MIV) resistance/capacitance, crosstalk, and supply/ground noise on analog accuracy (for VDAC, VMUL, comparator inputs) is not characterized at crossbar-scale frequencies.
  • eDRAM retention and refresh overhead: The retention time, refresh policies, and their energy/latency impact during compute—especially when eDRAM is used both for storage and as an LFSR counter—are not analyzed under PVT and temperature.
  • Thermal management in 3D: Thermal gradients and hotspots in stacked SRAM–eDRAM, and their effects on eDRAM retention, comparator offsets, DAC/ADC linearity, and timing margins, are not modeled or experimentally validated.
  • 1.8 V domain integration: The reliability (BTI/HCI), level shifters, inrush current control when switching between 0.8 V and 1.8 V, and the associated PDN/decap design and energy costs are not quantified; long-term aging and safe operating limits are not discussed.
  • Energy accounting scope: Reported energy/throughput figures exclude or only partially include peripheral overheads (e.g., global ramp generator, level shifters, wordline/bitline drivers, blockers, control FSMs, inter-tier drivers). A full-chip energy budget and sensitivity to control activity are missing.
  • Scalability beyond 32×32: Only a 32×32 case is parasitic-aware; scalability to larger arrays (e.g., 256×256, 1k×1k) is not validated for RC-induced delay, signal integrity, blocker resistance, read/write disturbance, and area/energy growth.
  • Blocker network overheads: Transmission-gate “blockers” used for transpose introduce area, leakage, and resistance; their on/off resistance variation, timing skew, and impact on write margins and error rates across large arrays are not evaluated.
  • Transpose generality: The method targets square matrices with “proper padding”; the cycle count, dataflow, and utilization for rectangular or very large matrices (with tiling across sub-arrays) remain unspecified.
  • Data movement and duplication: Element-wise multiplication requires matrix B to reside in both layers (and co-location with A in Layer A). The one-time and repeated data-loading costs, bandwidth, and host–CIM I/O energy are not quantified.
  • Utilization and fragmentation: Separating transpose and MA functions into dedicated sub-arrays reduces flexibility; the resulting capacity fragmentation, scheduling constraints, and system-level utilization under mixed workloads are not explored.
  • Error accumulation in algorithms: No end-to-end accuracy or convergence studies (e.g., RNNs, physics kernels) quantify how analog errors from multiple transpose/add/multiply steps compound, nor how mixed-signal noise/quantization propagates through real workloads.
  • Dot-product path benchmarking: While claiming compatibility with MAC, no quantitative comparison (latency/energy/accuracy) is provided versus state-of-the-art CIM dot-product arrays under identical PVT and precision assumptions.
  • Throughput normalization: GOPS figures are partly estimated and use different operation definitions across works; standardized benchmarking (per-bit, per-operand, per-array) and inclusion of all control/ADC/DAC overheads are needed for fair comparisons.
  • Area using logic rules: All custom bit-cells use logic design rules (not memory rules), inflating area; the feasibility, yield, and benefits of migrating to memory-rule cells or leveraging foundry SRAM/eDRAM macros are not assessed.
  • Reliability and fault tolerance: There is no discussion of ECC/BISR, soft-error susceptibility (especially in eDRAM used as counters), or design-for-test strategies for monolithic 3D stacks (pre-/post-bond test, isolation of faulty tiers).
  • Supply and clock distribution: Global distribution of the analog ramp, multi-GHz-equivalent timing accuracy for time-to-digital conversion, clock skew/jitter limits, and inter-tier clocking strategies are not specified.
  • Control microarchitecture and programming model: The work lacks a clear ISA/runtime model, compiler mapping of high-level matrix kernels to the proposed sub-arrays, and scheduling of cross-layer operations (transpose → multiply/add → MAC) at scale.
  • Concurrent operation and interference: The noise/throughput impact when transpose and arithmetic sub-arrays operate simultaneously (digital and analog activities coexisting across tiers) is uncharacterized.
  • Lifetime and aging: Long-term drift of comparator offsets, DAC weights (8:4:2:1), eDRAM leakage, and MIV resistance over temperature and aging, and their maintenance/calibration strategies, remain open.
  • Non-ideal line parasitics: The RC of shared R/W lines that span rows and their connection “yellow” buses is not quantified for delay, crosstalk, write disturbance, and analog corruption, especially for larger word widths and arrays.
  • ADC/LFSR code mapping overhead: The LUT to convert 8-bit LFSR codes to 6-bit results imposes area/latency; its implementation details, distribution, and scalability are omitted.
  • Technology portability: The design is shown in GF 22 nm FDSOI; portability to bulk FinFET nodes, different device options (e.g., no thick-oxide at advanced nodes), and changes in eDRAM availability are not discussed.
  • Fabrication and yield of monolithic 3D: Practical process integration constraints (thermal budgets, inter-tier alignment tolerances, MIV yield/defects) and their impact on bit-cell yield and large-array viability are not addressed.
  • Operation timing choices: The chosen cycle times (e.g., 8 ns for transpose, 3–6 ns for ADC cycles) lack justification versus worst-case PVT, array size, and noise; maximum achievable frequency and guardbands remain unclear.
  • Security and isolation: With analog paths across tiers, potential side-channel leakage, data remanence in eDRAM, and cross-tenant isolation (for multi-user accelerators) are unexplored.

These gaps outline concrete directions for validating, scaling, and systematizing the proposed architecture—from circuit-level robustness and calibration, through 3D-integration and thermal/reliability considerations, to software mapping and end-to-end workload accuracy.

Practical Applications

Immediate Applications

The following use cases can be deployed now by leveraging the paper’s demonstrated 22 nm FDSOI implementation, 3D-stacked SRAM–eDRAM macros, and mixed-signal in-memory kernels (transpose, element-wise add/multiply, MAC) with 4-bit inputs and ~6-bit effective outputs via LFSR-based ADC.

  • Semiconductor IP: in-situ matrix transpose macro for dataflow-heavy accelerators
    • Sector: Semiconductors, AI hardware, EDA/SoC design
    • What it does: Drop-in T-SRAM/T-eDRAM sub-arrays that perform in-memory transpose in N+1 cycles (vs ~2N traditionally) with digital storage and cross-tier transfer through monolithic inter-tier vias (MIVs).
    • Why it matters: Reduces on-chip data movement for tensor reordering common in GEMM tiling, FFT/DCT pipelines, batched image/video transforms, and attention tensor reshapes.
    • Tools/products/workflows: Macro library for T-SRAM/T-eDRAM; physical design collaterals (LEF/Lib/Verilog); RTL/firmware scheduling template for 3-step transpose across tiers; mapping pass in TVM/MLIR to fuse “transpose+compute” and minimize round-trips to cores; characterization models (latency/energy for tile sizes such as 32×32).
    • Assumptions/dependencies: Foundry support for monolithic 3D or equivalent fine-pitch stacking; thermal/yield guard-banding for stacked tiers; SRAM write overdrive margin; eDRAM refresh controller; DRC with “logic rules” cell sizes (area vs density trade-off).
  • Edge AI co-processor for low-precision inference
    • Sector: Mobile/IoT, wearables, embedded vision/audio, robotics
    • What it does: Mixed-signal, in-memory element-wise addition and Hadamard product, in-situ transpose, and conventional MAC for low-precision inference (e.g., 4-bit activations/weights, ~6-bit accumulation via LFSR ADC).
    • Why it matters: Core kernels in RNN/LSTM/GRU gating (Hadamard), attention scaling/masking, residual adds, and pre/post-processing can be executed with high energy efficiency (e.g., up to ~436 GOPS/W for add/mul reported).
    • Tools/products/workflows: Quantization-aware training recipes for 4-bit activations/weights; TFLite/ONNX/TensorRT custom ops for hadamard/add/transpose mapped to GEM3D-CIM; runtime that schedules DAC/ADC calibration, LFSR counting, and tier transfers; per-tile LUTs to decode LFSR codes to 6-bit values; health monitoring of PVT with in-field recalibration.
    • Assumptions/dependencies: Models must tolerate 4–6 bit effective precision (ENOB ~4.8); stable analog margins (DAC signal margins, comparator offsets); periodic LFSR-based ADC calibration; 1.8 V thick-oxide devices for DAC buffers; eDRAM refresh timing integrated with compute schedule.
  • Camera and ISP offload: transpose and pixel-wise ops
    • Sector: Mobile imaging, AR/VR devices, autonomous sensors
    • What it does: Offloads frequent transposes/reshapes between tiled color/depth planes and pixel-wise gain maps, masks, and blending (element-wise ops) directly in memory.
    • Why it matters: Reduces ISP bus traffic and latency in pipelines for HDR fusion, demosaicing, and feature-map rearrangement ahead of DNN processing.
    • Tools/products/workflows: ISP driver hooks to route tiles through T-SRAM/T-eDRAM sub-arrays; scheduling firmware that batches analog addition (gain/bias) and Hadamard (masking) with a single ADC pass; QC harness to validate image quality under PVT drift.
    • Assumptions/dependencies: Acceptable IQ with low-precision ops in pre/post stages; careful tiling to keep within 32×32 or similar array sizes; dark-silicon budget for local comparators and ramp circuits.
  • Robotics and drones: energy-efficient control and perception primitives
    • Sector: Robotics, UAVs, industrial automation
    • What it does: Executes element-wise gating, residual adds, and small-tile transposes in control pipelines (e.g., EKF updates, gating in RNNs for policies) and in perception backbones.
    • Why it matters: Extends battery life and enables untethered operation by reducing data movement and improving GOPS/W for matrix kernels.
    • Tools/products/workflows: ROS/MCU integration with a GEM3D-CIM coprocessor; microkernel library for element-wise ops and transposes; autotuner to pick tile sizes and schedule calibration cycles during idle windows.
    • Assumptions/dependencies: Control/perception algorithms must be robust to low precision; environmental PVT variation management (temperature compensation).
  • Academic testbed for mixed-signal CIM and 3D integration
    • Sector: Academia, research labs, graduate education
    • What it does: Provides a concrete, foundry-compatible platform to study algorithm–architecture co-design for CIM beyond dot products: transpose, Hadamard, addition, MAC in a single stack.
    • Why it matters: Enables reproducible research on low-precision algorithms, comparator offset calibration strategies, LFSR-based ADC linearity, and monolithic 3D reliability.
    • Tools/products/workflows: Open testbench RTL and SPICE decks; compiler passes for tensor tiling and operation fusion; coursework modules on CIM and 3D IC design; PVT/Monte Carlo datasets for benchmarking.
    • Assumptions/dependencies: Access to GF 22 nm FDSOI PDK or a suitable academic PDK; IP licensing for educational use; lab infrastructure for analog/mixed-signal validation.
  • Sustainability and procurement pilots emphasizing energy-per-op
    • Sector: Public sector, enterprise IT sustainability programs
    • What it does: Targets energy-intensive AI inference at the edge (e.g., smart cameras, kiosks) using in-memory compute macros to meet ESG goals.
    • Why it matters: Demonstrable GOPS/W improvements on common kernels (transpose/add/mul/MAC) reduce device energy footprints.
    • Tools/products/workflows: Pilot RFPs specifying energy/throughput metrics per operation (e.g., GOPS/W for element-wise ops); evaluation harnesses; runtime telemetry for energy accounting.
    • Assumptions/dependencies: Availability of COTS development kits or partner SoCs embedding GEM3D-CIM macros; validation on target workloads.

Long-Term Applications

These use cases require further research, scaling, or development—especially higher precision, larger arrays, richer operation sets, and broader manufacturing availability of monolithic 3D.

  • General-purpose in-memory linear algebra accelerator for HPC
    • Sector: High-performance computing, scientific simulation, energy, defense
    • Vision: Extend from 4–6 bit to 8–16 bit effective precision (e.g., via improved ADCs, calibration, and multi-pass accumulation), larger arrays (e.g., ≥256×256 tiles), and richer ops (reductions, scaling, fused multiply-add across tiles).
    • Potential products/workflows: Memory-on-memory blades co-stacked with CPUs/GPUs to accelerate transpose-heavy kernels (blocked GEMM, FFTs, tensor permutations); compiler/runtime that orchestrates compute/refresh/calibration across hundreds of tiles; mixed-signal verification tooling at scale.
    • Assumptions/dependencies: Advanced monolithic 3D processes with strong yield/thermal solutions; improved ADC linearity and offset handling; ECC for eDRAM; standardized software stacks.
  • Transformer training and large-model on-device inference
    • Sector: AI accelerators (datacenter and edge), mobile, AR/VR, automotive
    • Vision: Scale precision and array sizes to support attention operations (QKT transposes, masking via Hadamard, residual adds) and partial training (optimizers need higher precision).
    • Potential products/workflows: GEM3D-CIM backends in TVM/DeepSpeed/TensorRT; operator fusion to keep attention reorders, masking, and scaling in-memory; dynamic precision control (4–8 bits) with periodic calibration.
    • Assumptions/dependencies: Quantization strategies that preserve accuracy at low precision; reliable analog compute under datacenter thermal loads; built-in self-test for continual calibration; safety features for automotive-grade silicon.
  • Near-memory compute inside stacked DRAM/HBM packages
    • Sector: Cloud/enterprise compute, memory vendors
    • Vision: Adapt the memory-on-memory approach to high-bandwidth stacks (HBM/eDRAM-on-logic) using fine-pitch vertical interconnects to execute transposes and element-wise operations inside memory cubes.
    • Potential products/workflows: HBM with “smart banklets” offering in-situ transpose and element-wise kernels exposed via CXL/PCIe extensions; host drivers and libraries that offload tensor permutations and masking to memory.
    • Assumptions/dependencies: Manufacturing maturity for monolithic or hybrid bonding at HBM scale; coherence and security models for near-memory compute; standards for programming models and telemetry.
  • Domain-specific signal processing appliances
    • Sector: Communications (5G/6G), radar/LiDAR, audio
    • Vision: In-memory acceleration of transpose- and element-wise-heavy stages in MIMO detection, beamforming weight updates, and block-based transforms with strict latency targets.
    • Potential products/workflows: RF front-end companions that pre/post-process baseband tensors near memory; fixed-function pipelines using CIM macros with tunable precision; runtime that interleaves refresh, transpose, and multiply/add windows.
    • Assumptions/dependencies: Enhanced precision for link budgets; deterministic timing with calibrated analog front-ends; extended operating temperature range.
  • In-memory analytics for columnar databases and real-time BI
    • Sector: Data analytics, finance, retail
    • Vision: Offload column-wise element-wise transforms (scaling, masking, normalization steps) and frequent transposes/reshapes within memory-resident caches to reduce memory bandwidth usage.
    • Potential products/workflows: Smart cache controllers with GEM3D-CIM tiles; SQL engine rewrites that map vectorized operators to in-memory kernels; correctness/precision guards for financial-grade workloads.
    • Assumptions/dependencies: Precision expansion and verifiable error bounds; transactional integrity and recovery in mixed-signal paths; integration with existing cache hierarchies.
  • Platformization: software, verification, and standardization
    • Sector: EDA, standards bodies, compiler ecosystems
    • Vision: Mature toolchains covering algorithm–architecture co-design, analog-aware compilers, mixed-signal verification at wafer scale, and standard instruction sets for in-memory ops (transpose/hadamard/add/MAC).
    • Potential products/workflows: MLIR dialects for in-memory compute; analog-error models for training; ISO-style standards for monolithic 3D reliability, safety (ISO 26262), and sustainability reporting.
    • Assumptions/dependencies: Broad industry alignment on abstractions; PDK access for academia/consortia; robust IP protection for mixed-signal macros.

Cross-cutting assumptions and dependencies

  • Manufacturing and integration
    • Availability of monolithic 3D integration (fine-pitch MIVs), thermal management for stacked tiers, and acceptable yield.
    • Use of thick-oxide devices (1.8 V) for DAC buffers and reliable overdrive margins in T-SRAM write paths.
    • eDRAM refresh management co-scheduled with compute and robust isolation during transpose/multiply/add operations.
  • Precision, calibration, and robustness
    • Current design demonstrates 4-bit inputs with ~6-bit effective outputs (LFSR ADC ENOB ~4.8 bits); scaling to ≥8 bits requires better ADCs, calibration schemes, or multi-pass aggregation.
    • Comparator offsets and analog non-idealities must be calibrated (LFSR-based initial code, per-tile LUTs); PVT sensors and periodic recalibration needed for long-term stability.
  • Software and workflow readiness
    • Compiler/runtime support to tile tensors (e.g., 32×32), orchestrate 3-phase transposes, schedule DAC/ADC cycles, and fuse operations to amortize conversion overheads.
    • Quantization-aware training and validation harnesses to ensure accuracy under low precision and analog noise.
  • System integration
    • Dataflow co-design to minimize host-memory shuttling; DMA engines and APIs to expose in-memory ops as first-class kernels.
    • Verification and safety: mixed-signal signoff, ECC/parity for eDRAM, watchdogs for calibration drift, and fallbacks to digital paths when needed.

Glossary

Below is an alphabetical list of advanced domain-specific terms from the paper, each with a brief definition and a verbatim usage example.

  • 3T eDRAM: A three-transistor embedded DRAM cell used for dense, dynamic storage with explicit refresh. "a conventional 3T eDRAM core"
  • 6T SRAM: A six-transistor static RAM bit-cell used for stable, fast digital storage. "conventional 6T SRAM"
  • 8T SRAM: An eight-transistor SRAM bit-cell that adds read/write decoupling or functionality (e.g., DAC). "leverages an 8T SRAM bit-cell"
  • Analog to digital converter (ADC): Circuit that converts analog signals to digital codes. "analog to digital converter (ADC)"
  • Bit-cell: The smallest repeatable memory storage circuit used to build arrays. "The proposed bit-cell extends the conventional 6T SRAM structure"
  • Capacitive C2C multiplier: An analog multiplication technique using capacitor-to-capacitor charge redistribution. "A capacitive C2C multiplier is employed"
  • Charge-domain CIM: Compute-in-memory style where computations accumulate charge on capacitors. "Charge-domain CIM offers better linearity and higher compute density and energy efficiency"
  • Compute-in-memory (CIM): Performing computation directly within memory arrays to reduce data movement. "compute-in-memory (CIM) has emerged as a promising energy-efficient paradigm"
  • Current-domain CIM: Compute-in-memory style where computations sum currents on wires/devices. "Current-domain CIM provides improved energy efficiency, compute density, and SNR"
  • Digital-to-analog converter (DAC): Circuit that converts digital codes into analog voltages or currents. "digital-to-analog converter (DAC)"
  • Differential amplifier: An amplifier that compares two input voltages to produce a differential output. "PMOS-based differential amplifier"
  • Dot-product reduction: The aggregation step limiting parallelism when summing per-column partial MAC results. "performance is ultimately constrained by dot-product reduction"
  • eDRAM: Embedded DRAM offering higher density than SRAM with dynamic storage behavior. "Layer B consists of eDRAM arrays"
  • FDSOI: Fully-Depleted Silicon-On-Insulator CMOS technology with improved electrostatics and variability. "GlobalFoundries 22~nm FDSOI technology"
  • Hadamard product: Element-wise matrix multiplication used in many algorithms. "element-wise multiplication, also known as the Hadamard product"
  • Linear feedback shift register (LFSR): A shift-register with XOR feedback that cycles through pseudo-random sequences. "linear feedback shift register (LFSR) counter"
  • Lookup table (LUT): A precomputed mapping used to convert codes or accelerate functions. "using a lookup table (LUT)"
  • Memory crossbar: Matrix-like interconnect of memory cells enabling parallel access/compute. "within the memory crossbar"
  • Memory wall: A performance limit caused by slower memory bandwidth vs processor speed. "giving rise to the well-known memory wall"
  • Mixed-signal: Systems combining analog and digital circuits/processing. "mixed-signal CIM arrays"
  • Monolithic 3D integration: Fabricating multiple device layers on a single wafer with fine-pitch vertical connections. "monolithic 3D integrated CIM architectures"
  • Monolithic inter-tier vias (MIVs): Ultra-fine vertical vias connecting tiers in monolithic 3D stacks. "monolithic inter-tier vias (MIVs)"
  • Non-volatile memories (NVMs): Memories that retain data without power (e.g., ReRAM, PCM). "non-volatile memories (NVMs)"
  • ON/OFF conductance ratio: The contrast between a device’s conductive and non-conductive states. "reduced ON/OFF conductance ratio"
  • Overdriven: Driving a control line above nominal supply to improve write/drive strength. "WWL lines (overdriven) are activated"
  • Parasitic-aware simulations: Circuit simulations that include extracted parasitic resistances/capacitances. "parasitic-aware simulations"
  • Process corners: Representative variations of manufacturing parameters used to verify robustness. "across different process corners"
  • Process Design Kit (PDK): Foundry-provided models, rules, and files enabling technology-accurate design. "GlobalFoundries 22\,nm PDK"
  • PVT variations: Variations in Process, supply Voltage, and Temperature affecting circuit behavior. "essential for reliable transpose operations under PVT variations"
  • Ramp generator: A circuit producing a time-varying ramp voltage for time-encoding analog values. "a globally shared ramp generator"
  • Read word-line (RWL): Control line that enables reading data from a memory row. "read word-lines (RWL)"
  • Signal-to-noise ratio (SNR): Measure comparing signal strength to background noise level. "signal-to-noise ratio (SNR)"
  • Thick-oxide devices: Transistors with thicker gate oxide for higher-voltage operation. "thick-oxide devices capable of high-voltage operation"
  • Transmission gate (TG): A bidirectional CMOS switch used for low-resistance signal gating. "transmission gate (TG)"
  • TSV-based: Using through-silicon vias for vertical chip stacking/interconnect. "TSV-based or flip-chip bonding approaches"
  • Write word-line (WWL): Control line that enables writing data into a memory row. "write word-lines (WWL)"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 87 likes about this paper.