Thousand-Brains Systems: Modular Sensorimotor AI
- Thousand-Brains Systems are modular sensorimotor AI architectures where repeated cortical columns learn complete object models through movement and local evidence accumulation.
- They utilize sensor modules, body-centric reference frames, and a standardized messaging protocol to enable rapid object recognition, pose estimation, and active hypothesis testing.
- Empirical results demonstrate enhanced convergence speed, high classification accuracy, and efficiency with significantly reduced learning FLOPs compared to traditional deep models.
Thousand-Brains Systems are modular sensorimotor AI architectures derived from the Thousand Brains Theory, which treats the neocortex as a collection of many repeated, semi-independent functional units analogous to cortical columns. In this framework, each module can learn complete object models through movement, represent those models in reference frames, and communicate concise hypotheses to other modules rather than relying on a single monolithic latent model. The Thousand Brains Project presents this as a new paradigm for sensorimotor intelligence, and Monty is described as its first implementation for 3D object perception, object recognition, and pose estimation (Clay et al., 2024, Leadholm et al., 6 Jul 2025).
1. Conceptual basis and theoretical lineage
The defining claim of thousand-brains systems is that intelligence should be organized around repeated sensorimotor modules rather than around a single end-to-end function approximator. The motivating neuroscience picture is the Thousand Brains Theory: cortical columns are treated as repeated functional units; each receives local sensory data; each learns “its own model of the world” or complete object models; and coherent perception arises through communication, voting, and composition across many such modules. This view is explicitly linked to Vernon Mountcastle’s canonical-circuit idea, to grid-cell- and place-cell-inspired reference-frame mechanisms, and to long-range cortical communication carrying concise object-and-pose information (Clay et al., 2024).
This architecture is contrasted with standard deep learning along several axes. The cited papers emphasize active sensing over passive i.i.d.-style datasets, local sensor patches over full-input processing, structured object models over latent feature bags, rapid associative updates over slow global backpropagation, and heterarchical information flow over strict feedforward hierarchy. In Monty, intelligence is grounded in small, local glimpses obtained through movement; the resulting representations are object-centered, spatially structured, and intended to support pose estimation, active hypothesis testing, continual learning, and multimodal integration (Leadholm et al., 6 Jul 2025).
A recurrent theme across the literature is that the “thousand brains” effect appears only partly in any single module. A single learning module can already perform recognition through sensorimotor accumulation, but multiple modules provide faster convergence, redundancy across perspectives, and the ability to exchange pose-sensitive votes. This suggests that the plural in “systems” refers not merely to replication, but to a design principle in which distributed agreement is a computational primitive rather than an implementation detail (Clay et al., 2024).
2. Canonical architecture: modules, messages, and reference frames
The first-generation architecture described for thousand-brains systems comprises Sensor Modules, Learning Modules, a Motor System, and the Cortical Messaging Protocol. Sensor Modules convert modality-specific raw input into a common internal representation. Learning Modules are the repeating cortical-column-inspired units that accumulate sensorimotor evidence, build object models, maintain hypotheses over object identity and pose, communicate with other modules, and can emit goal states. The Motor System converts goal states into actuator commands. CMP is the standardized interface that binds the architecture together (Clay et al., 2024).
CMP is central because it imposes a common feature-at-pose abstraction. The white paper states that a CMP-compliant message must contain location, morphological features given by orthonormal vectors defining orientation, non-morphological features, confidence in , a use flag, sender ID, and sender type. In the RGB-D instantiation used by Monty, the message at step is
$\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$
where $\prescript{B}{}{x_t} \in \mathbb{R}^3$ is sensed location in body coordinates, $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$ is local sensed orientation, and contains non-pose features such as HSV or curvature magnitudes (Leadholm et al., 6 Jul 2025).
The use of a shared body-centric coordinate frame is what allows communication across modules. Sensor Modules may be modality-specific, but downstream Learning Modules are designed to operate on modality-agnostic feature-at-pose messages. This same protocol is reused for bottom-up sensing, lateral voting, top-down bias, and motor goal specification. The architecture is therefore explicitly modular in a software-engineering sense as well as in a neuro-inspired sense (Clay et al., 2024).
Reference frames are equally foundational. Objects are represented not as bags of features, but as features at locations and orientations. Incoming observations arrive in body-centric coordinates and are transformed into object-centered coordinates inside a module. This is the basis for pose generalization, movement-based disambiguation, and the possibility of hierarchical composition in which higher-level modules can treat lower-level object identities at pose as new “features” (Leadholm et al., 6 Jul 2025).
3. Object models, associative learning, and hypothesis-based inference
Monty’s object model for object is stored in an object-centered frame as
$\mathcal{M}^m = \left\{ (\prescript{M}{}{x_i}, \prescript{M}{S}{\mathbf{R}_i}, n_i) \right\}.$
Each stored point contains a location, a local surface orientation, and non-pose features. If the object’s global rotation in body coordinates is 0, then the sensed orientation is transformed into object coordinates by
1
Movement between steps is path-integrated in the same object frame: 2 These operations implement object-centered path integration rather than viewpoint-specific template matching (Leadholm et al., 6 Jul 2025).
Learning is described as Hebbian-like associative binding rather than global gradient-based optimization. In the white paper, quick memory updates are said to occur by associating co-occurring feature-pose observations, with no sharp long-term separation between learning and inference in the intended design. In Monty’s current implementation, this becomes sparse, local updating of the active object frame and active location during an episode, followed by graph-memory incorporation. The system is designed so that when an object is recognized it can immediately enrich that object’s stored model; when no object matches, it can create a new model from the current episode’s observations (Clay et al., 2024).
Inference is joint over object identity, object rotation, and current location on the object. Each learning module maintains a hypothesis set
3
Movement updates every hypothesis: 4 Evidence is then updated by comparing current sensed pose and non-pose features against learned points in a neighborhood around 5. Pose mismatch can decrease evidence, whereas non-pose features only add support. The most likely hypothesis is
6
and convergence is declared when
7
This evidential asymmetry operationalizes the claim that object identity is primarily determined by spatial structure rather than by incidental appearance (Leadholm et al., 6 Jul 2025).
4. Action selection, module interaction, voting, and symmetry
Movement is intrinsic to thousand-brains systems rather than appended after perception. Monty formalizes an agent as sensors plus actuator and studies both a distant agent, analogous to an eye or rotating camera, and a surface agent, analogous to a fingertip. The distant agent uses a spiral-like learning scan and a random-walk inference policy with a reflex that reverses when it moves off the object. The surface agent uses a curvature-guided model-free policy that follows informative local geometry such as mug rims, handles, edges, and contours; this improves overall accuracy, increases confident convergence, and reduces steps to convergence relative to less directed motion (Leadholm et al., 6 Jul 2025).
Monty also includes model-based action via each learning module’s Goal State Generator. The hypothesis-testing policy aligns the current top hypotheses in a common coordinate frame and moves toward the most discriminative location. If the leading hypothesis is object 8 and the competitor is object 9, the selected point is
0
In the reported surface-agent experiments, adding this policy improved accuracy from 1 to 2 and reduced median convergence steps from 3 to 4 (Leadholm et al., 6 Jul 2025).
Module interaction is not limited to hierarchy. The white paper characterizes Monty as a heterarchy with bottom-up links, top-down links, lateral voting, skip connections, and direct motor outputs from each learning module. Voting is pose-sensitive rather than bag-of-features consensus. If one module senses body-centric point 5 and another senses 6, then the instantaneous displacement
7
is transformed under each hypothesis as
8
The transmitted vote is therefore a prediction of where the receiving sensor should be on the object if the sender’s hypothesis is correct. Multi-module experiments showed that increasing the number of Learning Modules and Sensor Modules reduced convergence steps while maintaining similar classification accuracy (Leadholm et al., 6 Jul 2025).
Symmetry detection is another distinctive property. For object 9, Monty defines the high-evidence rotation set
$\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$0
If this set remains stable for enough consecutive steps and all hypotheses belong to the same object, the object is declared sensorimotor symmetric. Rotation error is then reported as the minimum geodesic distance to ground truth among the symmetric candidates. The reported validation used Chamfer distance between rotated object point clouds and is presented as evidence that the grouped rotations are geometrically indistinguishable in practice (Leadholm et al., 6 Jul 2025).
5. Extensions and reinterpretations
Several papers extend or reinterpret thousand-brains systems without discarding the core motif of many semi-independent modules coordinated by structured communication.
A spiking reinterpretation argues that Monty’s current dense floating-point contact vectors are mismatched to the temporal premises of the Thousand Brains Theory. In that proposal, each contact is encoded as a rank-order spike packet whose internal order reflects activation strength, and the inter-packet latency serves as an implicit displacement signal through $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$1. Directional structure is learned by STDP, and evidence accumulation is modified to
$\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$2
with a heuristic adaptation rule $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$3. The implementation is specified in approximately $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$4 lines of NumPy with $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$5 passing unit tests. In a synthetic arrangement-discrimination task, dense accumulation achieved $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$6 overall accuracy while temporal coding achieved $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$7; under tested noise levels it maintained a $\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$8-$\phi_t = \{\prescript{B}{}{x_t}, \prescript{B}{S}{\mathbf{R}_t}, n_t\},$9 percentage point advantage; and $\prescript{B}{}{x_t} \in \mathbb{R}^3$0 converged to $\prescript{B}{}{x_t} \in \mathbb{R}^3$1, $\prescript{B}{}{x_t} \in \mathbb{R}^3$2, and $\prescript{B}{}{x_t} \in \mathbb{R}^3$3 for uniform, moderate, and complex synthetic objects respectively. End-to-end evaluation on Monty’s YCB benchmark is explicitly left for future work (Bose, 21 May 2026).
Neo-FREE transports the thousand-brains idea from perception to control. It interprets cortical-column-like units as functional units returning stochastic control primitives $\prescript{B}{}{x_t} \in \mathbb{R}^3$4, and composes them through a gating mechanism that minimizes variational free energy. The resulting policy is
$\prescript{B}{}{x_t} \in \mathbb{R}^3$5
The paper proves that the per-step optimization over primitive weights is convex even when the environment is nonlinear, stochastic, and non-stationary and when the state/action cost is non-convex. In a Robotarium obstacle-avoidance task using four directional Gaussian primitives, the authors report five experiments from different initial conditions and state that in all experiments Neo-FREE enabled the robot to reach the goal while avoiding obstacles (Rossi et al., 2024).
Another extension targets association and surprise. Building on the Numenta/Hawkins thousand-brains formulation, it proposes similarity-finding algorithms that relax exact feature matching by replacing a feature SDR $\prescript{B}{}{x_t} \in \mathbb{R}^3$6 with a neighborhood
$\prescript{B}{}{x_t} \in \mathbb{R}^3$7
Similarity is then interpreted through whether another learned object can sustain a corresponding path of similar features under the same movement sequence. The same paper also introduces a surprise-triggered active-inference mechanism. Surprise is detected when a sufficient fraction of active sensory mini-columns are unpredicted in enough cortical columns, and the inference algorithm then resets prior continuity constraints, recomputes object hypotheses from current evidence, and sets the next movement vector to zero so the system re-samples or fixates the surprising cause. The authors explicitly connect this procedure to Friston’s free-energy and active-inference ideas, while noting that the connection is conceptual and procedural rather than a full variational derivation (Kawakami, 11 Jun 2025).
Taken together, these works indicate that “thousand-brains systems” function as a family of architectures rather than a single frozen implementation. The shared invariants are repeated semi-independent modules, object- or task-centered internal structure, and coordination by message passing, voting, or gating. The representational substrate, learning rule, and task domain remain open design variables.
6. Empirical profile, limitations, and research agenda
On the YCB household-object benchmark with $\prescript{B}{}{x_t} \in \mathbb{R}^3$8 objects, Monty was evaluated on joint object recognition and pose estimation using narrow $\prescript{B}{}{x_t} \in \mathbb{R}^3$9 RGB-D patches in Habitat. The main reported recognition-and-rotation results are as follows (Leadholm et al., 6 Jul 2025):
| Condition | Classification accuracy | Median rotation error |
|---|---|---|
| Baseline | 98.6% | 0° |
| Noise only | 95.1% | 3° |
| Novel rotations | 93.0% | 4.5° |
| Noise + novel rotations | 88.1% | 6° |
| Noise + novel rotations + all objects recolored blue | 73.1% | 7° |
These results are used to argue that Monty learns structured, shape-centric object models rather than texture-driven shortcuts. The color-replacement condition is especially significant in that interpretation because all observed patches were replaced by a uniform intense blue HSV value not seen during training, yet the system still classified $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$0 correctly. The paper also reports dendrogram analyses in which objects clustered into morphologically meaningful groups such as cutlery, boxes, and cups, supporting the claim that evidence patterns reflect global shape (Leadholm et al., 6 Jul 2025).
Few-shot and continual-learning results are similarly central. After only $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$1 views per object, Monty achieved $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$2 classification accuracy and $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$3 mean rotation error; after a single view per object it achieved about $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$4 classification accuracy, compared with about $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$5 for a from-scratch ViT, with chance at $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$6. In continual learning, the $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$7 YCB objects were split into $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$8 tasks, one object per task, and Monty retained strong performance across the sequence with only mild degradation due to interference among similar objects. The compared pretrained ViT exhibited catastrophic forgetting. The paper further reports that Monty uses about $\prescript{B}{S}{\mathbf{R}_t} \in SO(3)$9 fewer learning FLOPs than a ViT trained from scratch and about 0 fewer than a pretrained-plus-fine-tuned ViT; it also notes that the compared ViT has approximately 1M parameters, whereas Monty has about 2M parameters after learning the full dataset (Leadholm et al., 6 Jul 2025).
The cited literature is explicit that these results do not amount to a complete theory or mature general-purpose system. The white paper and Monty evaluation both describe the current implementation as early-stage. Stated limitations include narrow concentration on 3D object perception; lack of dynamic objects or richer object behaviors; incomplete exploration of full hierarchical composition; limited action-policy repertoire; unsupervised learning not evaluated in the main empirical study; inference cost that currently scales linearly with the number of learned models; and the use of explicit 3D graph memories as an implementation scaffold rather than a biologically literal claim (Clay et al., 2024, Leadholm et al., 6 Jul 2025).
The forward agenda is correspondingly broad. Proposed directions include richer sensorimotor tasks, multiple agents and modalities, hierarchical compositional representations, modeling object behavior and dynamics, more natural unsupervised learning, model merging and part reuse to mitigate inference scaling, richer multimodal and hierarchical planning systems, more neural implementations of learning modules, and deployment in domains such as agriculture, infrastructure maintenance, and medical ultrasound. A plausible implication is that thousand-brains systems are best understood, at present, as an architectural research program centered on reference-frame-based sensorimotor intelligence, with Monty as the primary perception instantiation and later works extending the same principles toward temporal coding, association, active inference, and control (Clay et al., 2024, Leadholm et al., 6 Jul 2025).