Causes of valid-format output failures in non-reasoning multimodal models

Determine why non-reasoning multimodal models (e.g., GPT-4o, Llama 4 Scout, Qwen 2.5 VL 72B) often fail to produce any validly formatted JSON answer grids for ConceptARC tasks, particularly in the visual modality.

Background

The evaluation reveals that several non-reasoning multimodal models achieve very low accuracy and frequently do not return the requested JSON output at all in the visual setting.

The authors explicitly flag understanding the root causes of these failures as a topic for future research.

References

It is a topic for future research to determine why these models had difficulty generating answers in any valid format.

— Do AI Models Perform Human-like Abstract Reasoning Across Modalities? (2510.02125 - Beger et al., 2 Oct 2025) in Appendix: Output Accuracy for Non-Reasoning Models

Causes of valid-format output failures in non-reasoning multimodal models

Background

References

Related Problems