ChartSimRL: Reinforcement Learning for Charts
- ChartSimRL is an RL framework that addresses both human-like scanpath simulation for chart reading and chart-to-code generation with multimodal rewards.
- It employs a hierarchical control system where an LLM decomposes analytical tasks and deep RL policies guide precise visual fixations.
- In chart-to-code generation, ChartSimRL optimizes for visual-semantic alignment using CNN-based similarity metrics and attribute matching to improve output fidelity.
ChartSimRL denotes distinct reinforcement learning frameworks developed for two primary chart-related domains: (1) simulating human scanpaths for task-driven chart reading, and (2) optimizing multimodal LLMs (MLLMs) for chart-to-code generation by rewarding visual-semantic chart similarity. Both paradigms leverage advanced RL methodologies and domain-specific multimodal reward formulations but diverge in their control architectures, data requirements, and applications.
1. ChartSimRL for Task-Driven Scanpath Simulation
ChartSimRL, originating from the Chartist framework, addresses the problem of predicting human-like eye movement patterns (scanpaths) on data visualizations under explicit analytical tasks such as value retrieval, filtering, and finding extremes (Shi et al., 5 Feb 2025). Traditional computational gaze models inadequately model task dependence, necessitating cost-intensive eye-tracking studies for controlled design analyses. ChartSimRL provides a scalable alternative: an end-to-end RL system reproducing the sequence and logic of task-driven chart reading.
ChartSimRL adopts a two-level hierarchical control system:
- High-Level Controller: An LLM-driven cognitive reasoner, tasked with decomposing tasks into discrete chart-analysis subgoals (e.g., "search_label," "locate_mark," "read_value," "answer"). The high-level policy is formulated as a partially observable Markov decision process (POMDP) with observations comprising the memory of all previously fixated chart regions, and actions as subgoal selections.
- Low-Level Controllers: Deep RL-trained oculomotor policies, one per subgoal, dictating spatial fixation maneuvers over the chart for subgoal completion. Inputs fuse foveal views, saliency maps, fixation history, and context.
Rewards are structured at both levels: the high-level controller is incentivized by task success and brevity, while low-level controllers receive positive signals for subgoal accomplishment and negative signals for saccade length, supporting efficient, human-like exploration.
2. Reinforcement Learning Formalism and Optimization
The hierarchical ChartSimRL system is formalized as a composition of two POMDPs:
- High-Level POMDP: States are (hidden) chart-task tuples; observations are memory summaries; actions are subgoals; transitions update memory with outputs from the low-level controller; rewards reflect task correctness and step penalties.
- Low-Level Subgoal POMDPs: States are latent; observations are rich multi-channel arrays; actions are discretized gaze shifts; rewards are shaped for timely subgoal achievement versus movement cost.
Low-level agents are trained by Proximal Policy Optimization (PPO), with the objective:
where encodes subgoal-specific immediate rewards. The PPO objective uses clipped importance sampling:
with as the advantage estimate and the policy-likelihood ratio.
3. Datasets and Training Procedures
Training and evaluation of ChartSimRL for scanpath simulation employs a combination of:
- Real-world charts: Over 200 manually annotated bar charts, with areas of interest (AOIs) marked for axis ticks, labels, and graphical marks.
- Synthetic charts: 500+ generated using the Vega-Lite grammar, facilitating controlled variation in design and attribute.
- Task Templates: For each chart, templates instantiate retrieval, filtering, and extreme-finding tasks, generating a diverse set of stepwise analytical challenges.
Low-level policies are trained episodically on chart/task pairs, with reward shaping that encourages early subgoal success. High-level LLM planners operate via few-shot prompting; no specialist scanpath data are required for RL training. Evaluation utilizes a 24-chart, 12-task held-out set with 183 human scanpaths acquired via high-resolution eye-tracking.
4. Empirical Evaluation and Behavioral Analysis
ChartSimRL is benchmarked against state-of-the-art baselines: leave-one-out human scanpaths, VQA-scanpath models, UMSS, and DeepGaze III. Three principal metrics are reported:
- Sequence Score on AOI sequences (normalized [0,1]): ChartSimRL achieves mean ≈ 0.41 (best 0.47), compared to human means of 0.49/0.64.
- Levenshtein (LEV) distance: ChartSimRL matches human performance more closely than baselines.
- Dynamic Time Warping (DTW) distance: ChartSimRL ranks second only to humans.
Aggregate statistics indicate that ChartSimRL's number of fixations (≈47/task) approximates human counts (≈90), and AOI hit ratios reflect correct task-related ordering. Qualitatively, scanpaths produced by ChartSimRL capture characteristic human zig-zag gaze patterns traversing labels, bars, and axes, which free-viewing models fail to emulate.
5. ChartSimRL for Chart-to-Code Generation
A distinct instantiation of ChartSimRL is proposed in the context of chart-to-code translation within the ChartMaster framework (Tan et al., 25 Aug 2025). Here, the RL objective is to maximize a chart-similarity reward that directly quantifies the visual and attribute alignment between generated charts and references. The key mechanism is as follows:
- Action: Candidate Python matplotlib code snippets sampled from a multimodal policy (Qwen2.5-VL-7B) conditioned on image-prompt pairs.
- Reward (R): , with both components normalized to [0,1], and , tunable.
- : Jaccard similarity of semantic chart attributes, allowing small numeric mismatches ().
- 0: Averaged (over 4 ResNet-18 layers) cosine similarity between feature maps of original and generated chart images.
- Optimization: Group-Relative Policy Optimization (GRPO), a PPO variant that normalizes advantages within groups of 1 sampled candidates per prompt, with KL-penalized updates toward an SFT reference policy.
A two-stage training pipeline is described: initial supervised fine-tuning on the ReChartPrompt-240K dataset, followed by ChartSimRL RL with controlled batch size and learning rate schedules.
6. Practical Impact and Limitations
The deployment of ChartSimRL for scanpath simulation enables cost-effective, scalable prediction of task-driven visual attention, offering integration points for explainable AI, visualization optimization, and adaptive user modeling in AR/XR systems. In the chart-to-code domain, the ChartSimRL framework establishes pronounced gains in executable chart reproduction rates and semantic/visual fidelity metrics: on ChartMimic, +2.7% exec. rate and +4.5/+4.0 low/high-level similarity improvements over SFT + ReChartPrompt (Tan et al., 25 Aug 2025).
However, scope limitations persist. In scanpath modeling, fixation durations and fine-grained spatial reasoning are not captured; LLM-based high-level policies may lack spatial disambiguation. For chart-to-code, optimality is contingent on the expressivity of attribute/visual rewards and the diversity within ReChartPrompt-240K. Real-world generalizability, especially for dense or exotic chart types, is recognized as an ongoing challenge.
7. Table: ChartSimRL's Two Domains
| Domain | Model Architecture | Reward Signal |
|---|---|---|
| Scanpath Simulation | Two-level (LLM + RL) | Task success, subgoal/ocular efficiency |
| Chart-to-Code Gen. | Single-level (LM RL) | Attribute + visual similarity (CNN-based) |
While both employ RL, the scanpath variant uses hierarchical controllers and episodic subgoal composition, whereas the code generation variant directly optimizes for multimodal output fidelity by rewarding attribute and visual alignment.
8. Extensions and Research Directions
Potential extensions for ChartSimRL include:
- Incorporation into interactive visualization design tools for automated scanpath evaluation and chart layout optimization (Shi et al., 5 Feb 2025).
- Explainable chart QA by combining gaze reasoners with neural chart comprehension.
- Adaptation to advanced visualization types, denser graphical encodings, and peripheral-trend perception.
- Enhanced chart-to-code models leveraging richer multimodal feature extractors or dynamic attribute grammars.
A plausible implication is that continued integration of RL-optimized, multimodal similarity-based feedback in both domains will enhance the alignment between automated models and human or production-grade chart reproduction standards.