Papers
Topics
Authors
Recent
2000 character limit reached

TiM: In-Memory Computation & LLM Memory

Updated 26 November 2025
  • TiM (Think-in-Memory) is a paradigm that integrates memory and computation to eliminate redundant data transfers in both hardware accelerators and LLM algorithms.
  • TiM-DNN leverages ternary processing cells within a tile-based architecture to perform parallel multiply-accumulate operations, significantly enhancing throughput and energy efficiency.
  • In LLMs, TiM employs memory structures to cache and evolve inductive thoughts, enabling efficient long-term dialogue reasoning and reducing repetitive inference.

TiM (Think-in-Memory) designates two distinct, influential paradigms at the intersection of memory and computation: (1) in-memory computation accelerators with a focus on ternary deep neural network inference ("TiM-DNN"), and (2) memory structures and algorithms enabling LLMs to maintain and evolve abstract thoughts for long-term reasoning and interaction. In both hardware and algorithmic contexts, TiM eliminates redundant data flows or computation, leveraging memory as the locus of thinking—whether for thousands of parallel multiplies in hardware or for long-term, semantically organized abstraction in dialogue agents (Jain et al., 2019, Liu et al., 2023).

1. Elimination of Redundancy: TiM Paradigm across Domains

The core principle common to TiM-DNN and TiM for LLMs is the movement of "thinking" into memory: computation of critical results, whether multiplications or abstractions, is performed directly where the data resides rather than through repeated transfer or re-computation.

  • In TiM-DNN, multiply-and-accumulate (MAC) operations are executed inside the memory array, directly within the specialized Ternary Processing Cells (TPCs), circumventing the von Neumann bottleneck associated with separate compute and memory subsystems. This enables massively parallel signed ternary vector-matrix multiplications with a single memory access (Jain et al., 2019).
  • In LLMs, repeated recall–reason loops are replaced by a mechanism where the agent persists its own inductive thoughts as memory entries. Later reasoning is performed by retrieving, reusing, and evolving these thoughts, eliminating redundant inference on past context (Liu et al., 2023).

This paradigm shift is supported by architecture-specific innovations—bit-cell co-design and tile-based organization in hardware; algorithmic primitives including ‘insert’, ‘forget’, and ‘merge’ for LLM memory.

2. TiM-DNN: Architecture and Hierarchical Structure

TiM-DNN implements the Think-in-Memory paradigm for deep neural network inference by tightly coupling storage and ternary computation at the bit-cell level.

  • Ternary Processing Cell (TPC): Encodes and stores ternary weights (w{1,0,+1}w\in\{-1,0,+1\} or w{a,0,+b}w\in\{-a,0,+b\}) and performs scalar multiplication with ternary inputs (x{1,0,+1}x\in\{-1,0,+1\}) directly.
    • Two cross-coupled inverters and access transistors encode two bits, AA and BB, with specific combinations representing each ternary state.
    • In the multiply phase, selective wordline activation and bitline discharge encode the ternary product wxw \cdot x as a voltage drop, which is digitized by a flash ADC.
  • Tile: A two-dimensional array of TPCs, organized into KK blocks of LL rows and NN columns, allowing LKNL \cdot K \cdot N parallel scalar MACs per access. Each tile produces partial sums for a vector-matrix multiply.
  • Bank and Accelerator-Level Organization: Tiles within a bank share instruction/control logic and buffers, while the accelerator may comprise multiple banks orchestrated by a scheduler.

The tile’s simultaneous activation of multiple rows (parallel vector-matrix product) fundamentally distinguishes TiM from traditional memory and compute architectures (Jain et al., 2019).

3. Computational Models and Dataflows

TiM-DNN supports both unweighted and weighted ternary computation:

  • Unweighted Case (w{1,0,+1}w\in\{-1,0,+1\}): Each BL/BLB line accumulates the analog sum of ternary products in parallel, with digital conversion to yi=nky_i = n - k, where nn and kk are the counts of +1+1 and 1-1 terms, respectively.
  • Weighted Case (w{a,0,+b}w\in\{-a,0,+b\}): Additional scaling logic after analog-to-digital conversion enables correct accumulation of asymmetric or symmetric ternary multiplications. Multi-pass evaluation supports input and weight scaling as required.

For convolutional layers, entire or partitioned filter matrices are mapped spatially or temporally across tiles. Recurrent layers, with smaller matrices, are mapped such that each tile operates concurrently on distinct submatrices per timestep, maximizing parallelism.

4. Performance, Comparative Evaluation, and Impact

Quantitative assessment of the TiM-DNN accelerator indicates substantial advancements over both conventional GPUs and existing in/near-memory accelerators:

  • 32-tile TiM-DNN:
    • Peak throughput: 114 TOPS/s
    • Power consumption: 0.9 W
    • Chip area: 1.96 mm²
    • Energy efficiency: 126 TOPS/W (≈300× V100 GPU)
    • Areal efficiency: 58 TOPS/mm² (≈388× V100 GPU)
  • Compared to state-of-the-art alternatives (Neural-Cache, BRein, TNN):
Accelerator TOPS/W TOPS/mm²
TiM-DNN 1.00 1.00
NVIDIA V100 0.0033 0.0026
Neural-Cache 0.015 0.032
BRein (65 nm) 0.018 0.0063
TNN (65 nm) 0.039 0.0034

TiM-DNN achieves 3.2×–4.2× speedup and 3.9×–4.7× energy reduction relative to near-memory baselines; it supports accurate, full signed/weighted ternary DNNs, avoiding the limitations of binary in-memory multiplies on task complexity (Jain et al., 2019).

5. TiM in LLMs: Thought-Based Memory Structures

In the LLM domain, TiM introduces an explicit memory cache M\mathcal{M} of inductive thoughts. The framework consists of two interleaved stages:

  • Recall & Generation:
    • On each user query QxQ_x, the system embeds QxQ_x, routes it via locality-sensitive hashing (LSH) to one of bb memory buckets, and retrieves top-kk thoughts by within-bucket similarity.
    • These retrieved thoughts augment the LLM prompt, producing a response RyR_y.
  • Post-thinking & Update:
    • After generating RyR_y, the LLM agent is prompted in "post-thinking" mode to derive new inductive thoughts TnewT_\text{new} summarizing or deducing one-hop relations from (Qx,Ry)(Q_x, R_y).
    • M\mathcal{M} is updated via insert, forget (via an LLM-prompted score ω(T)\omega(T) with threshold θf\theta_f), and merge (threshold θm\theta_m on embedding similarity leads to synthesis of new thought entries).

This approach ensures that the system persists and evolves higher-level inferences rather than redundantly recomputing them, supporting consistency and memory efficiency in long-range conversational reasoning (Liu et al., 2023).

6. Efficient Retrieval and Empirical Validation in LLMs

The TiM framework uses sign-random-projection LSH for scalable retrieval:

  • The hash index Hidx=F(x)=argmax[xR;xR]H_\text{idx} = \mathbf{F}(x) = \arg\max\bigl[xR; -xR\bigr] quickly routes embeddings of queries or thoughts to their memory buckets.
  • Sublinear retrieval complexity O(dNmem)O(d\sqrt{N_\text{mem}}) (for bNmemb\approx\sqrt{N_\text{mem}} buckets) is achieved, balancing LSH computation and local search.
  • Empirical evaluation demonstrates improvements over prior and ablation baselines in multi-turn dialogue datasets across languages (English, Chinese). Metrics include retrieval accuracy, response correctness, and contextual coherence.

Selected results:

Dataset/LLM Memory Retrieval Acc Correctness Coherence
GVD-En (ChatGLM) SiliconFriend 0.809 0.438 0.680
TiM (ours) 0.820 0.450 0.735
KdConv-Film (ChatGLM) no memory 0.657 0.923
TiM (ours) 0.920 0.827 0.943
RMD-Medical (ChatGLM) no memory 0.806 0.893
TiM (ours) 0.900 0.843 0.943

Additionally, average per-query retrieval time is reduced by ~15% relative to a baseline exhaustively searching all history (Liu et al., 2023).

7. Open Problems and Future Directions

TiM architectures and frameworks remain active research areas, with open questions including:

  • For TiM-DNN: scaling to ever-larger models and supporting additional forms of network quantization or hybrid precision.
  • For LLM-based TiM: automatic tuning of θf,θm,b\theta_f, \theta_m, b; learned or data-driven forget/merge strategies; handling unbounded memory growth via budgeted storing or memory compression; extension from single-hop to multi-hop or hierarchical thought graphs; and exploration of cross-modal memory (e.g., visual-linguistic).

These directions reflect the broader vision of Think-in-Memory: integrating memory, computation, and abstraction as unified substrate, whether in physical or algorithmic architectures (Jain et al., 2019, Liu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TiM (Think-in-Memory).