Causes of valid-format output failures in non-reasoning multimodal models
Determine why non-reasoning multimodal models (e.g., GPT-4o, Llama 4 Scout, Qwen 2.5 VL 72B) often fail to produce any validly formatted JSON answer grids for ConceptARC tasks, particularly in the visual modality.
References
It is a topic for future research to determine why these models had difficulty generating answers in any valid format.
— Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
(2510.02125 - Beger et al., 2 Oct 2025) in Appendix: Output Accuracy for Non-Reasoning Models