Image-Computable Perceptual Decision Model

Updated 18 November 2025

Image-computable perceptual decision models translate raw visual data into predictions of human choices through hierarchical neural transformations and evidence accumulation.
They integrate deep neural networks with biologically-inspired architectures to replicate perceptual dynamics, reaction times, and trial variability.
Personalization and targeted stimulus synthesis enable these models to precisely predict, manipulate, and analyze individual perceptual behaviors.

An image-computable model of perceptual decision making is a computational architecture that takes raw visual stimuli as input and produces predictions of human perceptual decisions—including choices, reaction times, and trial-by-trial variability—by explicitly modeling the perceptual, neural, and decisional processes in a manner directly anchored to the image data. These models integrate deep neural networks, probabilistic inference, and evidence accumulation principles, and they are evaluated and refined based on large-scale human behavioral data. Recent developments enable not only the prediction of aggregate behavior but also personalized alignment and even targeted manipulation of human perceptual choices through precise image synthesis.

1. Mathematical and Algorithmic Foundations

Image-computable models of perceptual decision making formalize the process as a sequence of transformations from pixels to decisions. The pipeline typically involves encoding visual input into latent neural representations, evidence accumulation toward categorical alternatives, and a stopping rule that terminates the process upon sufficient certainty.

Perceptual boundaries in ANNs: For a neural classifier $f: x \rightarrow (f_0(x),\ldots,f_{K-1}(x))$ , perceptual uncertainty is maximal near decision boundaries $B_{ij} = \{ x : f_i(x) = f_j(x) \}$ , where $f_i(x)$ are softmax outputs for classes $i,j$ (Wei et al., 6 May 2025).
Drift-diffusion and accumulator models: Evidence accumulation is typically modeled as a (multi- or uni-dimensional) stochastic process, e.g., $A(t) = vt + W(t)$ with first-passage time $T = \inf\{ t : A(t) = a \}$ , yielding inverse-Gaussian latency distributions (Duinkharjav et al., 2022), or via a linear ballistic accumulator (LBA) for multi-alternative settings (Jaffe et al., 25 Mar 2024).
Entropic/uncertainty-based stopping criteria: In spike-based or recurrent architectures, decision time is assigned as the earliest $t$ when entropy $H(t)$ of the categorical posterior drops below a threshold $\theta_H$ (Johnson et al., 14 Nov 2025).
Meta-Bayesian and POMDP approaches: Visuo-motor tasks such as gaze-contingent search are formulated as partially observable Markov decision processes, where saccades and categorical reports are planned to optimize expected free energy (Cullen et al., 2020).

These foundations enable models to map any given stimulus through explicitly defined computation to a predicted decision and associated metrics.

2. Model Architectures and Biological Realism

Recent image-computable models instantiate biologically-inspired architectures to bridge the gap between artificial neural networks and the primate visual system.

Hierarchical neural dynamics models: Multi-stage LIF spiking networks mimic the LGN→V1→MT→LIP dorsal pathway, with task-relevant connectivity, receptive fields, and synaptic dynamics. Decision formation emerges in attractor modules that accumulate input-driven population activity to threshold, reproducing drift-diffusion-like evidence accumulation at the population level (Su et al., 4 Sep 2024).
Convolutional and recurrent networks: Convolutional neural networks (CNNs), sometimes initialized from ImageNet-trained weights, serve as feature extractors that parameterize downstream evidence accumulation rates. Biologically-plausible recurrence and horizontal interactions are incorporated via hGRU or similar architectures to capture temporal integration and uncertainty dynamics (Jaffe et al., 25 Mar 2024, Goetschalckx et al., 2023).
Poisson spike-based coding: Integrating spiking statistics, a Poisson variational autoencoder encodes images as spike vectors, with Bayesian decoders operating sequentially on cumulative spike counts to infer action posteriors and stopping points (Johnson et al., 14 Nov 2025).

A summary of representative architectures is provided below:

Model Type	Biological Substrate	Decision Mechanism
LIF attractor network	Dorsal stream (LGN→V1→MT→LIP)	Population-rate threshold crossing
CNN + LBA	Front-end CNN	Multi-accumulator LBA
cRNN/hGRU	Recurrent visual cortex	Area-under-uncertainty
PVAE + Bayesian decoder	Poisson spike coding	Entropy-based stopping

Models fitted to behavioral and neuroimaging data can incorporate subject-specific differences by adjusting connectivities (informed by diffusion MRI, resting-state functional MRI) or fine-tuning network weights based on individual performance (Su et al., 4 Sep 2024, Wei et al., 6 May 2025).

3. Stimulus Generation and Human Validation

Image-computable models enable both the analysis and active synthesis of stimuli designed to probe or manipulate human decision making.

Perceptual boundary sampling: By optimizing stimuli (e.g., via guided diffusion) to lie near the ANN decision boundaries $B_{ij}$ , one can generate images that maximize human choice variability or elicit controversial decisions between matched participant pairs (Wei et al., 6 May 2025).
Loss functions for boundary targeting: Uncertainty guidance, $L_{uncertainty}(x) = -\frac{1}{2}\log p_i(x) - \frac{1}{2}\log p_j(x)$ , and controversial guidance for aligning adversarial responses across models, enable targeted synthesis of ambiguous or disagreement-eliciting stimuli.
Empirical dataset creation: Large-scale behavioral datasets like variMNIST (19,943 images, 116,715 human trials) support fine-grained mapping of decision variability, entropy, and response times (Wei et al., 6 May 2025).
Controversial manipulation experiments: Dual-subject targeted synthesis enables reliable manipulation of perceptual decisions, increasing the probability of obtaining experimenter-specified divergent choices between observers (Wei et al., 6 May 2025).

These mechanisms allow for systematic probing and manipulation of human visual decisions at both the group and individual level.

4. Personalization and Model Alignment

To accurately capture individual perceptual variability, image-computable models are increasingly equipped with subject-specific adaptation protocols.

Group and individual alignment: Baseline networks are first fine-tuned on group-level behavioral data (GroupNet), and then further customized to individual observers (IndivNet) by mixing in that subject's own trial responses in a specific data ratio (typically individual:group:original = 2:1:1) (Wei et al., 6 May 2025).
Performance and entropy alignment: Personalized alignment yields improvements in prediction accuracy and, crucially, raises the Spearman correlation between model-predicted and human response entropy from negligible ( $\approx$ 0.08) to high ( $\approx$ 0.74), particularly for high-entropy (difficult) images (Wei et al., 6 May 2025).
Neuroimaging-informed tuning: Structural (diffusion MRI tractography) and functional (resting-state fMRI) measures guide the adjustment of simulated connection strengths at biologically corresponding network edges, further explaining intersubject behavioral variability (Su et al., 4 Sep 2024).

A plausible implication is that full image-computable personalization, combining behavioral and neuroimaging alignment, is necessary to match the diversity of human perceptual strategies under ambiguous or adversarial stimulus conditions.

5. Temporal Dynamics and Speed-Accuracy Tradeoffs

Beyond choice prediction, comprehensive models capture the trial-to-trial temporal dynamics of perceptual decision making.

Spike-based variability and response times: Stochastic spiking architectures generate right-skewed response time distributions, matching empirical characteristics such as the dependence of response time on stimulus difficulty and number of alternatives (Hick’s law) (Johnson et al., 14 Nov 2025).
Uncertainty accumulation and RT proxies: In recurrent models, the area under the uncertainty curve provides a stimulus-computable proxy for reaction time that robustly correlates with human RTs across various paradigms—without explicit fitting to RT data (Goetschalckx et al., 2023).
Speed–accuracy tradeoffs: Manipulation of stopping criteria (entropy threshold $\theta_H$ or evidence bound $a$ ) recapitulates the classic speed–accuracy tradeoff, with lower thresholds yielding faster but less accurate decisions and vice versa (Johnson et al., 14 Nov 2025, Duinkharjav et al., 2022, Jaffe et al., 25 Mar 2024).
Drift-diffusion and accumulator mechanisms: Models based on analytic first-passage distributions (inverse-Gaussian for DDM, closed-form for LBA) quantitatively capture the distributional shape and scaling of RTs as a function of stimulus features and task manipulations (Duinkharjav et al., 2022, Jaffe et al., 25 Mar 2024).

These dynamics ensure that the models can recreate not only average accuracy but also the higher moments and full distributions of human perceptual response times.

6. Theoretical and Practical Implications

Image-computable models of perceptual decision making provide a rigorous platform for relating visual representations, neural circuit mechanisms, and behavioral variability.

Mechanistic insight: Explicit mapping from image properties through hierarchical visual processing to decision formation enables mechanistic interpretation of both normal and atypical decision behaviors.
Individual-difference analysis: Systematic alignment to individual choices and entropy, with optional neuroimaging-informed tuning, connects latent circuit parameters to observed variability within and across populations (Su et al., 4 Sep 2024, Wei et al., 6 May 2025).
Dataset generation and model-based experiment design: The capacity to synthesize adversarial and boundary stimuli allows experimenters to efficiently sample the space of stimuli with maximal diagnostic power, reducing trial requirements and maximizing the informativeness of behavioral assays.
Bridging vision science, computational neuroscience, and AI: These models operationalize canonical and recent theories (evidence accumulation, attractor dynamics, Bayesian inference) in architectures amenable to both biological interpretation and engineering applications.

Limitations remain, including the current focus on object recognition tasks, the need for broader demographic and cultural sampling, and the challenge of scaling to richer cognitive functions. However, the trajectory of this work points toward increasingly precise, general-purpose, and personalized models for probing and manipulating human perceptual decisions in complex naturalistic domains (Wei et al., 6 May 2025, Johnson et al., 14 Nov 2025, Jaffe et al., 25 Mar 2024, Su et al., 4 Sep 2024).