Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation (2508.13587v1)

Published 19 Aug 2025 in cs.AI and cs.CV

Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-LLMs, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

Summary

The paper demonstrates the novel MSRL method that integrates multimodal (textual and visual) rewards to break the SFT performance plateau in chart-to-code generation.
It employs a two-stage curriculum training strategy, initially optimizing for textual accuracy before introducing visual rewards to refine chart structure.
MSRL achieves significant benchmark improvements, with performance gains of 6.2% and 9.9% on ChartMimic and ReachQA respectively, surpassing previous methods.

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Introduction

The paper addresses a notable challenge in chart-to-code generation—an application requiring models to convert complex visual charts into structured code. While Supervised Fine-Tuning (SFT) has been traditionally employed in this domain, its efficacy is limited upon reaching a performance plateau. The authors introduce Multimodal Structured Reinforcement Learning (MSRL) to transcend these limitations. By integrating both textual and visual feedback within a reinforcement learning framework, the MSRL method showcases a significant advancement in performance metrics on standard benchmarks such as ChartMimic and ReachQA.

Problem Identification and Dataset Construction

A key limitation identified is that SFT alone leads to a performance bottleneck due to its inability to leverage the potential of large-scale data optimally. To explore these constraints, the paper constructs the largest chart-to-code dataset to date, comprising 3 million chart-code pairs sourced from real-world tables across arXiv papers. This comprehensive dataset enables a detailed examination of SFT's performance limitations at various scales.

Figure 1: The SFT plateau and RL performance gain from our experiments.

Proposed Methodology: MSRL

Textual and Visual Reward Systems

MSRL employs a multi-granularity structured reward system to optimize model outputs. Textual rewards are derived from a rule-based evaluation of code fidelity across multiple perspectives, such as data accuracy and formatting consistency. To complement this, visual rewards are implemented via a model-based evaluation that quantifies the structural accuracy of generated images against their ground-truth counterparts. This dual approach ensures comprehensive feedback that accounts for both specific content details and the overall chart structure.

Figure 2: The data generation pipeline and our proposed MSRL framework.

Two-Stage Curriculum Training

To maintain training stability, MSRL adopts a two-stage curriculum training strategy. Initially, the model is trained exclusively on textual rewards to refine the generation of high-fidelity code. Subsequently, the training incorporates visual rewards to further align rendered chart structures with ground-truth images, thereby enhancing the model’s global context understanding and visual precision.

Figure 3: Comparison of textual reward and execution rate changes between baseline and SFT models during the RL stage.

Experimental Results

The MSRL framework breaks new ground by achieving state-of-the-art results on the ChartMimic and ReachQA benchmarks. Notably, MSRL demonstrated a 6.2% and 9.9% improvement over previous approaches in high-level performance metrics on these benchmarks, respectively. These advances result from MSRL's capacity to optimally utilize both the extensive dataset and the structured reinforcement learning approach.

Figure 4: Comparison of reward gains during RL training with various reward settings.

Implementation and Evaluation

MSRL was executed on a baseline model, specifically Qwen2.5-VL-7B-Instruct, refined through structured fine-tuning followed by the MSRL method. Computational efficiency was ensured through staged training on GPU infrastructure, utilizing extensive hardware resources for reinforcement learning optimization. Evaluation against open-source and proprietary models demonstrated that MSRL not only rivals but occasionally surpasses larger proprietary systems such as GPT-4V and GPT-4o in chart-specific tasks.

Implications and Future Work

The introduction of MSRL highlights the potential of multimodal structured reinforcement learning in overcoming SFT limitations in chart-to-code applications. Beyond the immediate gains in performance, MSRL sets a precedent for future research to incorporate multimodal feedback systems in other domains demanding structured output generation from complex inputs. Further exploration into expanding visual reward capabilities and optimizing RL algorithms could drive continued innovation in this space.

Conclusion

The paper successfully demonstrates that MSRL can significantly enhance chart-to-code generation tasks by effectively leveraging multi-granularity feedback. This achievement underscores the limitations of traditional SFT approaches and presents MSRL as a robust alternative that is both scalable and adaptable to various chart types and complexities. Through comprehensive dataset curation and innovative reward implementation, the authors have set a new benchmark in this rapidly evolving field.

Figure 5: Showcasing charts generated by MSRL compared to proprietary and open-source MLLMs. The charts produced by MSRL align well with their ground-truth ones.