Attribution of assistance differences in search-enabled o4-mini-deep-research

Determine whether the lower predicted actionability and information access scores of the search-enabled OpenAI o4-mini-deep-research relative to the standard OpenAI o4-mini are caused by integrated web search capabilities or by additional safety measures implemented specifically in the search-enabled variant.

Background

The study compares assistance scores between OpenAI o4-mini and its search-enabled variant o4-mini-deep-research under benign decomposition prompts. Although the search-enabled variant yields lower predicted scores, the authors state they cannot isolate whether this is due to the presence of integrated web search or to extra safety guardrails embedded in that variant.

Resolving this attribution would clarify the causal role of integrated search versus safety tuning in shaping model assistance for fraud and cybercrime tasks, informing both capability evaluations and safety design.

References

On web search, the search-enabled variant (o4-mini-deep-research) produced lower predicted scores (actionability: 1.94, information access: 2.38 under benign decomposition) compared to standard o4-mini (2.08, 2.66). However, we cannot definitively attribute this difference to search capabilities versus additional safety measures implemented in the search-enabled variant.

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios  (2602.21831 - Mai et al., 25 Feb 2026) in Results, subsection 'Impact of Reasoning and Search'