Papers
Topics
Authors
Recent
2000 character limit reached

RL-Text2Vis: RL for Text2Vis Generation

Updated 9 January 2026
  • The paper introduces RL-Text2Vis as the first formal RL framework for generating concise textual answers and executable visualization code from tabular data.
  • It employs a multi-objective reward using GRPO to jointly optimize textual accuracy, code validity, and visualization quality without a critic network.
  • Evaluations reveal significant improvements with up to +10 percentage points in code execution and +22% enhancement in visualization clarity over inference-only methods.

RL-Text2Vis denotes a reinforcement learning (RL) framework for aligning LLM outputs in the Text2Vis task: generating concise answers and executable visualizations from natural language queries over tabular data. Unlike prior approaches using only supervised fine-tuning (SFT) or inference-time agentic refinement, RL-Text2Vis provides a fully end-to-end RL training solution. Leveraging Group Relative Policy Optimization (GRPO) and a multi-objective, post-execution reward that incorporates textual accuracy, code validity, and visualization quality, RL-Text2Vis achieves significant improvements in chart clarity and code executability in both in-domain and out-of-domain evaluations. This framework represents a departure from agentic or actor–critic analogies and constitutes the first formal RL approach for Text2Vis generation, with all pipeline and evaluation code publicly released (Rahman et al., 8 Jan 2026).

1. Formal Problem Definition and Policy Structure

RL-Text2Vis addresses the Text2Vis problem, which requires transforming a pair (q,T)(q, T)—a natural language query qq and a tabular dataset TT—into a composite output y=(a,c)y = (a, c): a short answer aa and visualization code cc. The model’s behavior is formalized as sampling yy from a policy πθ(yx)\pi_\theta(y \mid x), where x=(q,T)Dx = (q, T) \in \mathcal D. Framed as a Markov Decision Process (MDP):

  • State (sts_t): Sequence of output tokens produced so far, comprised jointly of both answer and code fields.
  • Action (ata_t): Emission of the next output token yty_t.
  • Transition: Deterministic: st+1=st{yt}s_{t+1} = s_t \cup \{y_t\}.
  • Episodic interaction: Generation concludes with a special token or upon reaching a maximum sequence length.
  • Reward (R(x,y)R(x, y)): Sparse, observed only at the end of each episode and computed via post-execution feedback across multiple modalities.

The RL objective is to maximize expected episodic reward:

maxθ  ExD,yπθ(x)[R(x,y)].\max_\theta \; \mathbb{E}_{x\sim\mathcal D,\, y\sim\pi_\theta(\cdot\mid x)} \big[\,R(x, y)\,\big].

This explicit RL formulation distinctively contrasts with inference-only, agentic, or pseudo–actor–critic methods (Rahman et al., 26 Jul 2025), which lack an MDP, learning episodes, or reward-driven parameter updates.

2. Model Backbone and Architectural Components

RL-Text2Vis builds on the Qwen2.5-Instruct LLM family (7B and 14B parameters) without architectural modifications. The model is trained to produce JSON-formatted responses with "answer" and "code" fields:

  • Answer: A concise textual response grounded in the tabular data.
  • Code: Valid Python/Matplotlib script responsible for rendering the requested visualization, terminated with plt.show().

Multi-modal reward signals are generated via three external judge modules:

  • Textual judge: Assigns semantic correctness to the answer, using LLM-based alignment.
  • Code judge: Evaluates code via sandboxed execution with intent matching.
  • Vision judge: Scores chart readability and correctness via VLM (Qwen2.5-VL).

No value network or critic is learned by the model during RL training. All reward signals are consumed in a sample-efficient, episodic training paradigm.

3. Multi-Objective Reward Formulation

The post-execution reward R(x,y)R(x,y) for candidate outputs is structured as a weighted sum of three components computed only after the complete response is available:

  • Textual Correctness Rtext[0,1]R_{\text{text}} \in [0,1]: LLM-based semantic match to the ground-truth answer.
  • Code Validity Rcode{0,0.5,1}R_{\text{code}} \in \{0, 0.5, 1\}: Combination of code execution success (IexecI_{\text{exec}}) and intent alignment (IintentI_{\text{intent}}).
  • Visualization Quality Rvis[0,1]R_{\text{vis}} \in [0,1]: Mean of readability and correctness metrics produced by the visual judge.

The full reward is:

R(x,y)=αRtext+βRcode+γRvis,R(x, y) = \alpha\,R_{\text{text}} + \beta\,R_{\text{code}} + \gamma\,R_{\text{vis}},

with α=0.5\alpha = 0.5, β=0.25\beta = 0.25, γ=0.25\gamma = 0.25. Candidates not passing the JSON format check receive R=0R = 0. This composition ensures the model must optimize across all three axes to maximize reward.

4. Group Relative Policy Optimization: Objective and Implementation

The RL-Text2Vis framework introduces Group Relative Policy Optimization (GRPO), a novel policy gradient method that forgoes value function learning:

  1. Grouping: For each input, sample GG output candidates from πθ\pi_\theta, compute rewards for all.
  2. Ranking and Normalization: Compute groupwise mean rˉ\bar r and standard deviation σr\sigma_r; derive relative advantages A^i=(rirˉ)/σr\hat{A}_i = (r_i - \bar r)/\sigma_r.
  3. Surrogate Loss Accumulation: For each completion yiy_i, compute PPO-style clipped surrogate objective for all tokens, using a KL penalty to constrain drift from a reference (SFT-initialized) policy:

JGRPO(θ)=E[1Gi=1G1yit=1yimin(it(θ)A^i,  clip(it(θ),1ε,1+ε)A^i)] βDKL(πθπref),\mathcal J_{\rm GRPO}(\theta) = \mathbb E\Bigg[ \frac1G\sum_{i=1}^G \frac1{|y_i|}\sum_{t=1}^{|y_i|} \min\bigl( i_t(\theta)\,\hat A_i,\; \operatorname{clip}(i_t(\theta),1-\varepsilon,1+\varepsilon)\,\hat A_i \bigr) \Bigg] -\ \beta\, D_{KL}(\pi_\theta \,\|\, \pi_{\rm ref}),

where it(θ)i_t(\theta) is the per-token policy ratio, ε=0.1\varepsilon=0.1 is the clipping threshold, and β=0.04\beta=0.04 controls the KL regularization.

GRPO leverages intra-group relativity to drive improvement while reducing variance and avoiding instability associated with value function approximation.

5. Training Protocol and Hyperparameters

Training on the Text2Vis benchmark begins with SFT-converged Qwen2.5 model weights, without mixing SFT and RL objectives. Notable parameters and strategy include:

  • Group size: G=8G=8 completions per prompt.
  • Effective batch: Gradient accumulation across 8 batches, yielding 384 groups per update.
  • Optimization: AdamW with learning rate 1×1051\times 10^{-5}, weight decay $0.1$, PPO clip $0.1$, gradient norm clip $0.1$.
  • Precision: bf16 mixed precision, gradient checkpointing to manage memory.
  • Hardware and schedule: 7B model on 4×A100-80GB GPUs for ≈25 hours; 14B model on 6×H100-80GB for ≈50 hours, performing 2 epochs over 1,749 samples.

No reward model or critic network is trained; all feedback for the RL objective is derived from the aforementioned external judge modules.

6. Empirical Evaluation and Comparative Performance

RL-Text2Vis is evaluated in both in-domain (Text2Vis test2) and out-of-domain settings (VIS-Eval, NVBench), with the following summary of results:

Model Code Exec Answer Match Readability Chart Correctness Pass Rate
GPT-4o (zero-shot) 87% 39% 3.32 3.30 30%
RL-Text2Vis-14B 97% 35% 4.10 4.03 29%

Relative to GPT-4o, RL-Text2Vis-14B significantly increases code execution success (+10 percentage points) and chart quality (relative +22%). Textual answer alignment remains similar. For out-of-domain datasets:

  • VIS-Eval (Qwen2.5-7B): Code exec increases from 57% (zero-shot) to 72%; readability from 1.50 to 2.50; correctness from 0.69 to 1.37.
  • NVBench (Qwen2.5-7B): Code exec from 75% to 93%; readability from 2.64 to 3.47; correctness from 2.34 to 3.28.

These results indicate robust generalization of RL-Text2Vis to unseen datasets, schemas, and query types, with gains concentrated in executable and readable visualizations.

7. Distinctions from Inference-Time Agentic and Actor–Critic Methods

A crucial distinction exists between RL-Text2Vis and the “agentic” actor–critic loop described in (Rahman et al., 26 Jul 2025). The latter operates solely at inference time, with an actor (LLM) generating initial outputs, a critic module providing feedback, and the actor refining outputs in a single shot. It lacks any of the core RL machinery: there is no policy training, no definition of MDP elements such as states, actions, transitions, or rewards, and no policy optimization objective. All model “learning” derives from offline pretraining or SFT.

In contrast, RL-Text2Vis employs genuine RL optimization, explicitly framing Text2Vis as an MDP and carrying out reward-based policy learning over episodes, with all improvements arising from online policy adaptation instead of inference-only refinement.

8. Implications, Limitations, and Prospects

The RL-Text2Vis methodology demonstrates that post-execution, multi-objective RL is an effective strategy for aligning LLMs and vision-LLMs in complex, multimodal reasoning tasks. A plausible implication is that reward-driven optimization substantially closes the gap between open-source and closed-source systems for Text2Vis, especially in generating high-quality, executable visualizations. However, the import of the reward signal is contingent on the robustness and calibration of the judge models, and RL training requires considerably greater computational resources than SFT or inference-time agentic loops. Future directions include dynamic planning, tool use, and online reward model learning for finer-grained multimodal control (Rahman et al., 8 Jan 2026, Rahman et al., 26 Jul 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to RL-Text2Vis.