Papers
Topics
Authors
Recent
2000 character limit reached

XtraGPT: Dual AI Systems

Updated 4 December 2025
  • XtraGPT is a dual system comprising an academic paper revision tool and an Excel add-in for automated formula generation.
  • It uses a controllable, section-level revision approach with 4-tuple inputs and robust quality protocols to improve paper quality.
  • The Excel add-in leverages ChatGPT to build, explain, and test formulas, enforcing human-in-the-loop verification for practical use.

XtraGPT comprises two unrelated systems introduced under the same name in distinct domains: (1) a suite of open-source LLMs specialized for controllable academic paper revision (Chen et al., 16 May 2025), and (2) an open-source Excel add-in that leverages ChatGPT for automated formula generation and verification (O'Beirne, 2023). Both are designed to augment human expertise by integrating LLM-based reasoning into complex, practice-driven workflows.

1. Human–AI Collaboration for Academic Paper Revision

XtraGPT (Chen et al., 16 May 2025) addresses the substantial limitations of general-purpose LLMs in high-fidelity scientific writing. Standard LLMs exhibit significant deficiencies, including a tendency to focus on surface-level linguistic improvements while neglecting logical structure, conceptual coherence, and cross-sectional argumentation. They also lack mechanisms for managing the multi-round, iterative feedback processes central to academic drafting, with each prompt treated in isolation and no memory of prior edits.

To overcome these constraints, XtraGPT implements a controllable, section-level paper revision paradigm. The system formulates each revision task as a 4-tuple (q,T,p,p^)(q, T, p, \hat p), where qq is a high-level user instruction, TT the full paper context, pp the target paragraph, and p^\hat p the revised paragraph. Revision control is operationalized via a canonical set of 20 section-level criteria CC, covering all major paper sections (title, abstract, introduction, background, evaluation, conclusion) and encompassing priorities such as “Strength and Clarity of Motivation” and “Experimental Setup Clarity and Reproducibility.”

2. Dataset Construction and Quality Protocols

The XtraQA corpus comprises 7,040 ICLR 2024 submissions (filtered to 6,994 after deduplication and length capping), yielding 140,800 instruction–revision pairs. For each instance, a section, criterion, and paragraph are selected, an edit instruction qq is generated, and the paragraph is revised by GPT-4o-Mini to produce p^\hat p. A held-out test set (5% of papers; 7,000 pairs) supports benchmarking.

Quality control involved human annotation by three PhD-level evaluators across four axes: Instruction Following, Criteria Alignment, In-Context Reference, and Revision Acceptance (1–5 scale). Aggregated scores for GPT-4o-Mini edits were all 3.0\geq 3.0, indicating sufficient reliability for downstream model development.

3. Model Architecture, Objectives, and Inference

The XtraGPT model family consists of decoder-only transformers adapted from the following open-source LLMs:

Model Backbone Parameter Count
XtraGPT-1.5B Qwen-2.5-1.5B-Instruct 1.5B
XtraGPT-3.8B phi3.5-3.8B 3.8B
XtraGPT-7B Qwen-2.5-7B-Instruct 7B
XtraGPT-14B phi4-14B 14B

All utilize a single-stream [q; T; p] input, with standard transformer block designs (rotary embeddings, multi-head attention) and up to 16K token contexts. There is no encoder; context and instruction are directly prepended.

Post-training—termed Controllable Post-Training (CPT)—employs a standard maximum-likelihood objective over (q,T,p,p^)(q, T, p, \hat p) demonstration pairs:

LCPT(θ)=E(q,T,p,p^)DCPT[logPθ(p^q,T,p)]\mathcal{L}_{\mathrm{CPT}}(\theta) = -\,\mathbb{E}_{(q, T, p, \hat p) \sim \mathcal{D}_{CPT}}\left[\log P_\theta(\hat p \mid q, T, p)\right]

Although qq and p^\hat p are implicitly tied to criterion cCc \in C, no auxiliary losses (e.g., explicit coherence regularization) are used.

Inference supports both criterion-guided and free-form revision instructions, with system-level constraints enforcing scope (revise only the user-selected paragraph) and maximum output length (512 tokens). Iterative, user-driven submission cycles allow multi-round revision, but with each invocation contextually isolated (no persistent state).

4. Evaluation Strategies and Benchmarks

Automated evaluation relies on the Length-Controlled Win Rate (LC-Win Rate), with a modified LLM-as-Judge protocol (alpaca_eval_gpt4_turbo_fn) to debias for response verbosity. The winrate is formalized as:

qθ,ϕ,ψ(mM)=logistic[(θmθM)+ϕM,btanh(len(zm)len(zM)std())]q_{\theta, \phi, \psi}(m \succ M) = \operatorname{logistic} \left[(\theta_m - \theta_M) + \phi_{M, b} \tanh\left(\frac{\operatorname{len}(z_m) - \operatorname{len}(z_M)}{\operatorname{std}(\ldots)}\right)\right]

winrateLC(m,M)=100Ex[qθ,ϕ,ψ(mMx)]\operatorname{winrate}^{LC}(m, M) = 100 \cdot \mathbb{E}_x\left[q_{\theta, \phi, \psi}(m \succ M \mid x)\right]

Comparisons encompass GPT-4o-Mini, GPT-3.5-Turbo, Qwen-2.5-xB, Phi-4, DeepSeek-v3, and Llama-3.

Section-wise LC-Win Rates demonstrate that XtraGPT-7B outperforms all same-scale baselines and surpasses GPT-4o-Mini in the Abstract, Evaluation, and Conclusion sections.

Human evaluation on a held-out set (300 paragraphs) via pairwise expert rating yields mean scores of 3.7/5\sim 3.7/5 for Instruction/Criteria, 3.4/5\sim 3.4/5 In-Context Reference, and 3.2/5\sim 3.2/5 for Revision Acceptance, all above neutral threshold. Case studies establish XtraGPT’s superiority in targeted, context-consistent edits.

5. Quantitative Results and Impact

XtraGPT-7B achieves a 55.5% overall LC-Win Rate, compared to GPT-4o-Mini’s 51.8% and Qwen-2.5-7B’s 40.8%. At the 14B-parameter scale, XtraGPT maintains this lead, with similar advantages over proprietary systems. For each backbone scale (1.5B to 14B), XtraGPT outperforms its base model by 10–30 points on LC-Win Rate.

“AI-Scientist” full-paper quality reassessment on 54 real ICLR submissions confirms that post-revision drafts exhibit marked improvements: Contribution (+0.23/4, +7.9%), Presentation (+0.28/4, +12.5%), Soundness (+0.19/4, +6.4%), and Overall rating (+0.65/10, +10.8%), all with p<0.01p < 0.01.

6. Limitations and Prospects

The primary limitations of XtraGPT (Chen et al., 16 May 2025) are domain specificity (exclusive training on AI/ML papers), generator bias due to the use of GPT-4o-Mini for revision pair synthesis, and contextual isolation—no persistent revision state or full-paper consistency enforcement. Evaluation protocols depend on LLM-based judges and AI-Scientist scoring, which remain only partial proxies for actual peer review.

Future work targets multi-round dialogue interfaces with explicit change tracking, coverage expansion to further scientific domains, introduction of auxiliary objectives (e.g., global coherence, citation consistency), and improved automated metrics for holistic paper quality.

7. XtraGPT for Excel: Formula Generation Add-in

A separate instantiation of “XtraGPT” is documented as an open-source Excel extension enabling users to prompt ChatGPT for formula construction, explanation, and test-case suggestion directly within the worksheet environment (O'Beirne, 2023). The architecture integrates a ribbon button, an Office.js-based taskpane interface, and interaction with the OpenAI API. The add-in leverages system and user prompt engineering to elicit precise, testable responses, automatically injects formulas and explanations, and logs/catches all errors.

Workflow best practices explicitly adopt O’Beirne’s “trust but verify” paradigm: every ChatGPT-offered formula is unit-tested, checked for edge cases, and logged for reproducibility. The tool targets rapid prototyping but enforces systematic error-handling and human-in-the-loop verification, particularly for multi-criteria or nontrivial calculations.

Deployment requires Office 365 with support for sideloaded add-ins and Node.js for local hosting. The system is distributable via manifest.xml and can be centrally managed through administrative catalogs. Maintenance recommendations emphasize version-pinning of system prompts and thorough testing in controlled environments before production use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to XtraGPT.