LeetCode-Style Q Benchmark
- LeetCode-Style Q Benchmark is a systematic framework that adapts traditional coding challenges to assess model performance in the Q programming language used in quantitative finance.
- It employs a multi-stage LLM adaptation pipeline including domain-adaptive pretraining, supervised fine-tuning, and reinforcement learning, ensuring robust model evaluation.
- Empirical results show that Qwen-2.5 models significantly outperform GPT-4.1, highlighting the framework’s effectiveness in domain-specific language modeling.
A LeetCode-Style Q Benchmark refers to a systematic evaluation framework modeled after LeetCode’s coding challenges, repurposed to assess model performance on tasks formulated in the Q programming language. This approach originated from the need to benchmark LLMs in underrepresented, domain-specific languages such as Q, which is widely used in quantitative finance but seldom appears in mainstream code corpora. The methodology for a LeetCode-Style Q Benchmark was formalized in recent work that introduces a fully open-source pipeline for dataset generation, model adaptation, and rigorous evaluation (Hogan et al., 9 Aug 2025).
1. Dataset Construction and Task Design
The benchmark dataset is bootstrapped from curated LeetCode problems initially designed in Python. Each problem entry includes (i) a descriptive prompt, (ii) a reference canonical Python solution, (iii) canonical test cases with reference outputs, and (iv) example usages. To create Q-language analogs:
- Solutions are programmatically translated from Python to Q using a strong instruction-tuned LLM.
- Test harnesses are independently generated from Python test cases, isolating solution and evaluation logic to prevent reward hacking.
- Rejection sampling and automated verification ensure only Q solutions passing all canonical tests are accepted.
- Problem descriptions are embedded and similar previously solved problems are retrieved to enhance idiomatic Q code generation.
- Iterative rounds of bootstrapping, fine-tuning, and manual review expand the dataset, culminating in a validated collection (678 total: 542 train/136 test).
2. Model Adaptation Pipeline
Effective benchmarking in Q necessitates multi-stage LLM adaptation:
- Domain-Adaptive Pretraining: Collected Q code (from GitHub repositories and Kdb+ documentation) is filtered for code quality, chunked into 4096-token blocks, and used for next-token prediction training, with early stopping tuned for each model size.
- Supervised Fine-Tuning (SFT): Models are fine-tuned on the curated LeetCode-Q dataset, covering problem description–to–Q, Python–to–Q, Q–to–Python, and test harness translation. Ablations on learning rates, training schedules, and adapter methods (LoRA vs. full-model) inform hyperparameter selection.
- Reinforcement Learning (RL): Final model alignment employs Group Relative Policy Optimization (GRPO) using programmatic rewards (fraction of passed test cases, with perfect solution bonuses). RL experiments further test reasoning-prompting, temperature sampling, and variant reward functions.
3. Evaluation Metrics and Methodology
Performance is measured using the standard pass@k accuracy, defined as
with indicating the number of completions and the count of correct solutions. Evaluation is parallelized (e.g., 100 workers) for high throughput (up to 136 problems × 40 completions in 12 minutes). Both direct code generation and reasoning+code generations are scored.
The test split contains only unseen problems—guaranteed by partitioning after dataset freeze and manual curation—to ensure statistical validity.
4. Empirical Results
Table: Summary of Model Performance on LeetCode-Q Benchmark ((Hogan et al., 9 Aug 2025) Table 3, paraphrased)
Model | Pass@1 (%) | Improvement vs. Claude Opus-4 | Outperforms GPT-4.1 |
---|---|---|---|
Qwen-2.5-32B-Reasoning+RL | 59 | +29.5 | Yes |
Claude Opus-4 | 29.5 | Baseline | No |
GPT-4.1 | < Qwen-1.5B | N/A | No |
All Qwen-2.5 family models, including the smallest 1.5B parameter model, outperform GPT-4.1 on this Q-specific benchmark. The strict separation of solution/test harness and model-in-the-loop dataset growth reduces reward hacking and spurious test case overfitting.
5. Technical Advancements and Methodological Innovations
Prominent advancements introduced include:
- Strict separation between solution and test harness generation, mitigating overfitting via test case leakage.
- Retrieval augmentation to enhance generation of idiomatic Q, using semantic similarity embedding.
- Automated rejection sampling to enforce validity, executing generated Q in interpreter environments and comparing outputs to canonical Python solutions.
- Iterative bootstrapping—each model improvement cycle directly feeds new verified Q samples into the next fine-tuning phase.
- End-to-end stack: The pipeline encompasses domain-adaptive pretraining, supervised fine-tuning, and RL-based alignment, with continuous evaluation to identify overfitting (e.g., using loss curves).
6. Broader Applicability and Limitations
While this Q benchmark is domain-specific, the methodology is generalizable to any niche or low-resource language with available canonical reference code and verifiable test harnessing. Adaptation to domains where evaluation depends on soft or subjective signals is plausible, leveraging RL with reward signal engineering. The robust separation of solution and evaluation artifacts and iterative data-centric bootstrapping form a blueprint for future benchmarks in other specialized settings.
This suggests that targeted domain adaptation—backed by dataset curation, controlled verification, and iterative fine-tuning—can produce LLMs that substantially outperform general-purpose models in low-resource, non-mainstream programming contexts. However, applicability is currently constrained to tasks with deterministic, automatable ground-truth evaluation, limiting immediate extension to subjectively graded domains.
7. Resource Availability and Community Impact
The referenced paper provides all models, code, data, and a detailed methodological blueprint as open-source resources (Hogan et al., 9 Aug 2025). This transparency facilitates adoption, extension, and replication within both academic and industry contexts seeking rigorous Q-language benchmarking or analogous settings requiring customized, contamination-free code evaluation. The benchmark serves as a canonical testbed for Q software agents, driving advances in LLM adaptation and evaluation within quantitative finance and beyond.