Sequential sampling without comparison to boundary through model-free reinforcement learning (2408.06080v1)

Published 12 Aug 2024 in cs.NE

Abstract: Although evidence integration to the boundary model has successfully explained a wide range of behavioral and neural data in decision making under uncertainty, how animals learn and optimize the boundary remains unresolved. Here, we propose a model-free reinforcement learning algorithm for perceptual decisions under uncertainty that dispenses entirely with the concepts of decision boundary and evidence accumulation. Our model learns whether to commit to a decision given the available evidence or continue sampling information at a cost. We reproduced the canonical features of perceptual decision-making such as dependence of accuracy and reaction time on evidence strength, modulation of speed-accuracy trade-off by payoff regime, and many others. By unifying learning and decision making within the same framework, this model can account for unstable behavior during training as well as stabilized post-training behavior, opening the door to revisiting the extensive volumes of discarded training data in the decision science literature.

Summary

The paper introduces a model-free RL algorithm that eliminates fixed decision boundaries by learning optimal policies through trial and error.
It demonstrates that increased evidence coherence leads to higher decision accuracy and faster reaction times, replicating key psychometric and chronometric patterns.
The model flexibly balances speed-accuracy trade-offs under varying cost-benefit ratios, highlighting its potential for broader decision-making applications.

Analyzing Sequential Sampling through Model-Free Reinforcement Learning

The paper presents a novel approach to understanding perceptual decision-making under uncertainty using a model-free reinforcement learning (RL) framework. This paper challenges traditional evidence accumulation models by proposing an RL-based algorithm that eliminates the need for specifying decision thresholds, leveraging a simpler, more flexible learning mechanism.

Introduction and Background

Sequential sampling models, particularly the drift-diffusion model (DDM), have been instrumental in elucidating the relationship between reaction time (RT) and accuracy in various decision-making scenarios, including perceptual, value-based, and moral decisions. These models typically operate by continuously accumulating noisy evidence over time until a predefined decision boundary is reached. Despite their success, conventional models face significant challenges: they often disregard the learning process involved in setting these boundaries and struggle to adapt decision criteria dynamically.

Previous works attempting to explain the evolution of DDM parameters during learning stages have shown that decision bounds decrease and drift rates increase, but they fail to specify the underlying mechanisms for these adaptations. Additionally, model-based RL approaches, although theoretically flexible, demand complex calculations and an in-depth understanding of environmental structures, which might not be feasible for all animals or practical systems.

Methods and Approach

The proposed model-free RL algorithm circumvents the traditional boundaries of evidence accumulation by introducing a simplified state-action-reward framework. Key distinctions of this model include:

A state variable that evolves based on noisy sensory evidence.
An action set that includes a "Wait" option, permitting ongoing sampling of evidence at a cost.
Q-learning to update action values based on received rewards, bypassing the explicit comparison to a decision boundary.

The algorithm starts with zero initial Q-values and lets the agent develop its decision criteria through trial and error. Importantly, the model does not necessitate accumulation of momentary evidence and can still function adequately even when employing extrema detection instead.

Results

The model demonstrates several canonical features of perceptual decision-making:

Psychometric Curves: Increased coherence led to higher accuracy in decision-making.
Chronometric Curves: Reaction times decreased with higher coherence levels.

The model mimicked the dynamic adaptation observed in empirical studies by modulating its terminal state values and allowed flexible speed-accuracy trade-offs depending on the cost-benefit ratio (CBR) of actions. Increased CBRs led to higher accuracy at the expense of longer reaction times, reflecting adaptability to different payoff regimes.

Further, the model was benchmarked against an optimal expected reward framework and demonstrated approximate optimality in many cases, despite being model-free and simpler in its calculations.

Implications

This advancement unifies learning and decision-making processes within a single framework without the necessity for explicit decision boundaries. It opens new avenues for exploring previously discarded data from training periods in perceptual decision-making studies. The model's ability to dynamically adapt to varying CBRs suggests potential applications in environments where decision criteria must be flexible and responsive to changing conditions.

Future Directions

Several promising research directions emerge from this paper:

Integrated Drift Rate Learning: Extending the model to learn drift rates concurrently with decision policies could further bridge the gap between model-based and model-free approaches.
Neurobiological Correlates: Investigating neural implementations of the proposed RL framework to identify how such decision-making processes might manifest in the brain.
Broader Applications: Applying the model to more complex decision-making scenarios, including multi-option choices and hierarchical decisions, could test its robustness and versatility further.

Conclusion

This paper introduces a significant shift in perceptual decision-making research by transitioning from boundary-based models to a model-free RL paradigm. The resulting framework retains psychological plausibility while simplifying the decision processes, maintaining adaptability, and achieving near-optimal performance. This work has the potential to transform our understanding of decision dynamics and learning mechanisms in uncertain environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JimmyEsmaily/status/1823740193633132677