Papers
Topics
Authors
Recent
2000 character limit reached

NM-TOS: Near-Memory Architecture for TOS Updates

Updated 9 December 2025
  • The paper introduces NM-TOS, a near-memory framework using an 8T SRAM design and pipelined row-level updates to achieve a 24.7× speedup in TOS updates for corner detection.
  • It employs dynamic voltage and frequency scaling with a three-counter sliding window for adaptive, energy-efficient processing of event-driven data.
  • Quantitative results demonstrate latency reduction from 392 ns to 16 ns and maintain robust detection accuracy with minimal BER impact under aggressive voltage scaling.

Near-Memory Architecture for Efficient TOS Updates (NM-TOS) is a hardware-centric framework devised for accelerating Threshold-Ordinal Surface (TOS) updates used in corner detection tasks for Event-based Cameras (EBCs). By integrating a read-write decoupled 8-transistor (8T) SRAM cell architecture, row-level pipelining, and dynamic voltage and frequency scaling (DVFS), NM-TOS delivers per-event threshold updates with substantially reduced latency and energy, while sustaining robust corner detection accuracy—even under aggressive voltage scaling. These properties make NM-TOS particularly suited for low-power, high-throughput applications on edge devices where rapid event-driven computation is essential (Shang et al., 2 Dec 2025).

1. System Pipeline and Architectural Overview

NM-TOS operates in a multi-stage processing pipeline that transforms raw event streams from an Address-Event Representation (AER) sensor into corner classifications. Events, represented as v=(vx,vy,vp,vt)v=(v_x,v_y,v_p,v_t), are initially filtered by a Spatio-Temporal Correlation Filter (STCF) to attenuate isolated noise. In parallel, a lightweight controller executes a dynamic event-rate measurement using a three-counter sliding window, calculating the short-term event throughput fef_e. This rate parameter guides a lookup table (LUT) responsible for selecting the optimal supply voltage VddV_{\text{dd}} and clock frequency fclkf_{\text{clk}} for downstream processing.

Subsequent to denoising and dynamic adaptation, the NM-TOS core executes an event-by-event (EBE) TOS update over a P×PP\times P spatial neighborhood (P=7P=7 by default). Once updated, the TOS surface is input to a frame-by-frame (FBF) Harris LUT, classifying corners at candidate event positions.

Pipeline Overview

Stage Function Output
STCF denoising Suppress isolated noise in event stream Filtered events
DVFS controller Adapt VddV_{\text{dd}}, fclkf_{\text{clk}} Speed/energy config
NM-TOS patch update EBE TOS update over P×PP\times P Refreshed TOS
Harris LUT Classify corners Corner outputs

This dataflow enables real-time, adaptive corner detection with minimized latency and energy overhead (Shang et al., 2 Dec 2025).

2. 8T Read-Write-Decoupled SRAM Cell Structure

Fundamental to NM-TOS is the employment of a physically isolated 8T SRAM cell ("type A"), distinguishing the Read BitLine (RBL) and Write BitLine (WBL) via dedicated NMOS access transistors (M_A1/M_A2 for RBL; M_W1/M_W2 for WBL). The standard six elements configure back-to-back inverters to retain the storage state.

This architecture allows simultaneous read/write operations: reading from row ii while writing to row i1i-1. Such decoupling eliminates critical path dependencies, supporting a pipelined update strategy rather than sequential per-row read-compute-write cycles.

Key cell features:

  • Dedicated access transistors per read/write vector.
  • Full retention using conventional cross-coupled inverter pairs.
  • Functional completeness and cell robustness validated via 65 nm CMOS SPICE simulations, with zero bit error rates at Vdd0.62VV_{\text{dd}} \geq 0.62\,\text{V} and controlled performance degradation below this voltage (Shang et al., 2 Dec 2025).

3. Row-Level Pipelining and Timing Optimization

Patch updates in TOS require sequential manipulation of PP rows per input event. Traditional methods incur latency:

Lserial=P×(t1+t2+t3+t4)L_{\text{serial}} = P \times (t_1 + t_2 + t_3 + t_4)

with t1,t2,t3,t4t_1, t_2, t_3, t_4 representing precharge, read/“minus-one”, compare, and write-back delays. By leveraging decoupled RBL/WBL circuitry, these four phases are organized in a classic four-stage pipeline:

Lpipelined=P×(t1+t2)+t3+t4L_{\text{pipelined}} = P \times (t_1 + t_2) + t_3 + t_4

This reduces row update costs to (t1+t2)(t_1 + t_2) after an initial prologue. For P=7P=7 in a 65 nm implementation, pipelined operation achieves approximately 16ns16\,\text{ns} latency at Vdd=1.2VV_{\text{dd}} = 1.2\,\text{V}, equating to a throughput of 63.1Meps63.1\,\text{Meps}—a 24.7× speedup over conventional serial digital implementations (~392 ns).

4. Hardware–Software Co-Optimization and DVFS Integration

The data-dependent nature of EBC throughput motivates real-time adaptation of power-performance envelopes. NM-TOS employs a three-counter sliding window (window =10ms=10\,\text{ms}, stride =50%=50\%) to derive fef_e and select configuration parameters via LUT. Dynamic energy per update adheres to:

EdynamicCVdd2αE_{\text{dynamic}} \propto C \cdot V_{\text{dd}}^2 \cdot \alpha

where CC is net capacitance and α\alpha is switching activity. Thus, reducing VddV_{\text{dd}} from 1.2V1.2\,\text{V} to 0.6V0.6\,\text{V} can, in principle, yield a 0.25× energy reduction. With clock adaptation, actual savings reach up to 6.6× in measured implementations.

5. Quantitative Performance and Robustness Characterization

Performance characterization of NM-TOS utilizes 65 nm CMOS SPICE benchmarks with P=7P=7 patch size:

  • Latency: At Vdd=1.2VV_{\text{dd}} = 1.2\,\text{V}, serial digital patch update incurs \sim392 ns; NM-TOS pipelined yields \sim16 ns.
  • Throughput: Conventional digital methods achieve 2.6Meps2.6\,\text{Meps}; NM-TOS pipelined supports 63.1Meps63.1\,\text{Meps}.
  • Energy: Per patch update energy is 166pJ166\,\text{pJ} ([email protected]), 139pJ139\,\text{pJ} ([email protected]), 26pJ26\,\text{pJ} ([email protected]).
  • Bit Error Rate (BER): Robust operation (0%0\% BER) at Vdd0.62VV_{\text{dd}} \geq 0.62\,\text{V}; BER rises to 0.2%0.2\% at 0.61V0.61\,\text{V} and 2.5%2.5\% at 0.6V0.6\,\text{V}.

Since only the three most significant stored bits (levels 8:5) are utilized, the impact of BER on practical corner detection outcome is marginal (Shang et al., 2 Dec 2025).

6. Corner Detection Accuracy and Application Impact

Corner detection performance was evaluated using precision-recall AUC on two Prophesee datasets ("shapes_dof" and "dynamic_dof"). Even under worst-case BER (2.5%2.5\% at 0.6V0.6\,\text{V}), the degradation in detection is minor:

  • “shapes_dof”: Δ\DeltaAUC ≈ 0.027
  • “dynamic_dof”: Δ\DeltaAUC ≈ 0.015

This signifies that hardware-induced imperfections, even under aggressive DVFS, incur negligible real-world reduction in event-based corner detection quality.

7. Significance and Future Perspectives

NM-TOS establishes a template for integrating near-memory architectures, pipelined microarchitectures, and adaptive DVFS in resource-constraint environments demanding rapid response to event-based sensory data. The achieved combination—8T SRAM topology, pipelined patch updates, peripheral co-optimization, and flexible power scaling—demonstrates a practical pathway to bridge algorithmic advances in event-driven computer vision and hardware limitations of edge deployment. A plausible implication is the viability of NM-TOS strategies in broader contexts involving patch-based updates of ordinal surfaces or similar representations in real-time applications.

For comprehensive method details and implementation, see (Shang et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Near-Memory Architecture for Efficient TOS Updates (NM-TOS).