Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 66 tok/s Pro
Kimi K2 163 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges (2509.15283v1)

Published 18 Sep 2025 in cs.SE, cs.AI, cs.LG, and cs.PL

Abstract: This study examines the performance of today's open-source, locally hosted large-LLMs in handling complex competitive programming tasks with extended problem descriptions and contexts. Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline through the Ollama runtime, collapsing FACE's sprawling per-problem directory tree into a handful of consolidated JSON files, and adding robust checkpointing so multi-day runs can resume after failures. The enhanced framework generates, submits, and records solutions for the full Kattis corpus of 3,589 problems across eight code-oriented models ranging from 6.7-9 billion parameters. The submission results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models, Gemini 1.5 and ChatGPT-4. These findings expose a persistent gap between private, cost-controlled LLM deployments and state-of-the-art proprietary services, yet also highlight the rapid progress of open models and the practical benefits of an evaluation workflow that organizations can replicate on in-house hardware.

Summary

  • The paper evaluates local LLMs on 3,589 Kattis problems, identifying significant performance gaps compared to proprietary cloud models.
  • It extends the FACE framework with JSON-based data handling and robust checkpointing, reducing file overhead by 99.9% and ensuring fault tolerance.
  • Analysis reveals that local LLMs trade generation speed for accuracy, with top models achieving 5.7-5.4% acceptance rates versus ~10.9% for cloud counterparts.

Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges

Introduction

This paper presents a comprehensive evaluation of open-source, locally hosted LLMs on the Kattis competitive programming corpus, comprising 3,589 problems of varying complexity. The paper extends the FACE (Framework for AI-driven Code Generation Evaluation) pipeline to operate entirely offline via the Ollama runtime, introducing significant architectural and operational improvements. The primary objective is to benchmark the capabilities of local LLMs in code generation tasks, contrasting their performance with state-of-the-art proprietary models such as Gemini 1.5 and ChatGPT-4. The analysis focuses on solution correctness, generation efficiency, and failure modes, providing insights into the practical trade-offs of local versus cloud-based LLM deployments.

FACE Framework Extension and System Design

The original FACE architecture automates the end-to-end process of problem collection, solution generation, and submission for evaluation. The authors retrofit FACE to support local inference, replacing the per-problem directory tree with a consolidated JSON-based data organization. This change reduces file system overhead by 99.9%, streamlining maintainability and scalability for large-scale experiments. Figure 1

Figure 1: Original FACE architecture, with numbers indicating each step in the process.

The integration of locally hosted models is achieved via a standardized interface, enabling the generator component to process JSON payloads containing problem statements and test cases. The submission process is enhanced with atomic file operations and robust checkpointing, ensuring fault tolerance and recoverability during multi-day runs. These modifications facilitate the evaluation of thousands of problems across multiple models without incurring the cost, latency, or privacy risks associated with cloud APIs.

Experimental Setup and Model Selection

Eight local LLMs, ranging from 6.7B to 9B parameters, are selected based on their availability on the Ollama platform and their ability to run on commodity hardware (8GB VRAM). The models include CodeLlama, CodeQwen, DeepSeek-Coder, DolphinCoder, Granite-Code, Llama 3.1, Qwen2.5-Coder, and Yi-Coder. Their pre-training regimes vary from code-centric datasets to multilingual corpora, with context window sizes spanning 16K to 128K tokens.

The experiments are conducted on a dual NVIDIA L4 GPU server, with solution generation and submission spanning over three weeks. The FACE pipeline orchestrates model loading, solution generation, and Kattis submission, recording detailed outcome statistics for each problem-model pair.

Performance Analysis: Generation Time and Efficiency

Generation time is a critical metric for practical deployment. The paper reports mean, median, and standard deviation of response times for each model, with outlier exclusion based on the 1.5 IQR rule. Notably, Granite-Code and Llama 3.1 exhibit the lowest median generation times, while Qwen2.5-Coder and Yi-Coder, despite slower generation, achieve the highest acceptance rates. Figure 2

Figure 2: Histograms of generation times (log-scale frequency) for each model.

Figure 3

Figure 3: CDF comparison of generation time across models.

The CDF analysis reveals that models with higher acceptance rates (Qwen2.5-Coder, Yi-Coder) tend to generate solutions more slowly, indicating a trade-off between computational efficiency and solution quality. This has direct implications for deployment scenarios where throughput and latency are critical.

Correctness and Failure Modes

Correctness is evaluated via the pass@1 metric, representing the probability that a single generated solution is accepted by Kattis. The breakdown by problem difficulty (Easy, Medium, Hard) shows that even for Easy problems, the majority of submissions result in Wrong Answers or Run Time Errors. Qwen2.5-Coder and Yi-Coder outperform other local models, but their acceptance rates remain modest (5.7% and 5.4%, respectively).

For Medium and Hard problems, acceptance rates drop precipitously, with most models failing to produce any correct solutions for the hardest tier. Compile errors, wrong answers, and run-time errors dominate the failure landscape, highlighting the limitations of current local LLMs in handling complex, context-rich programming challenges.

Comparative Evaluation: Local vs. Cloud-Based Models

The paper benchmarks local models against proprietary cloud-based LLMs using previously published results. Gemini 1.5 and ChatGPT-4 achieve pass@1 rates of ~10.9% and ~10.7% on a subset of Kattis problems, approximately double the best local model rates. The performance gap is consistent across difficulty levels, with cloud models maintaining higher accuracy, especially on easier problems.

However, local models offer unrestricted usage, circumventing token quotas and cost barriers inherent to cloud APIs. This enables large-scale experimentation and deployment in privacy-sensitive or resource-constrained environments, albeit at the expense of solution quality.

Practical and Theoretical Implications

The findings underscore the rapid progress of open-source LLMs but also delineate a persistent gap in code generation competence relative to proprietary models. The JSON-based FACE extension and checkpointing mechanisms provide a replicable workflow for organizations seeking to benchmark or deploy local LLMs at scale.

From a practical perspective, local LLMs can support educational tools such as explain-as-you-grade autograders and in-IDE tutoring agents, provided that solution correctness is not mission-critical. The results suggest that hybrid workflows—combining local inference with cloud fallback, fine-tuning on domain-specific datasets, and advanced prompt engineering—may be necessary to bridge the performance gap.

Theoretically, the paper highlights the challenges of generalization in LLMs trained on public code datasets, especially for problems with limited online solution availability. The persistent failure modes suggest that further research into model architecture, training data diversity, and alignment strategies is warranted.

Future Directions

The authors propose several avenues for future work: fine-tuning local models on competitive programming datasets, integrating debugging and self-correction mechanisms, and exploring hybrid inference strategies. As open-source models continue to evolve, large-scale, transparent benchmarks will be essential for guiding model development and informing deployment decisions in both academic and industrial contexts.

Conclusion

This paper provides a rigorous evaluation of locally hosted LLMs on a large and diverse set of programming challenges, revealing both their potential and current limitations. While local models offer cost and privacy advantages, their code generation accuracy lags behind proprietary cloud-based solutions. The extended FACE framework and experimental results serve as a foundation for future research into scalable, private, and effective AI-driven code generation systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper tests how well “local” AI coding tools can solve lots of tough programming puzzles. Local means the AI runs on your own computer or server, not in the cloud. The authors build a reliable, offline testing system and use it to compare eight open‑source models on 3,589 problems from Kattis, a site known for serious coding challenges.

In short: they ask, “If you don’t want to send your code to the cloud (for privacy, speed, or cost reasons), how good are today’s open models at solving real, complex programming problems?”

What questions did the researchers try to answer?

  • How well do open-source, locally run AI coding models solve a very large and difficult set of problems?
  • How do they compare to top cloud models like ChatGPT‑4 and Gemini 1.5?
  • Can we build a simple, fully offline evaluation setup that others can copy and use on their own hardware?

How did they run the paper?

Think of the paper like a well-organized “robot” that: 1) collects many problems, 2) asks an AI to write code, 3) submits that code to a judge, 4) saves the results carefully so nothing is lost.

Here’s what they did, in everyday terms:

  • The problem set: They used Kattis, which has over 3,500 programming problems of all levels. Unlike sites like LeetCode (where solutions are often public), Kattis discourages sharing answers. That makes it a tougher, more honest test because the models are less likely to have memorized solutions from the internet.
  • Local models: They picked eight open-source models that are small enough to run on normal GPUs (about 6.7–9 billion parameters). Running them locally helps with privacy, avoids cloud fees, and removes API limits.
  • Upgrading the testing system: They improved a tool called FACE so it runs fully offline using Ollama (a platform for running local LLMs). They also:
    • Swapped thousands of tiny folders and files for a handful of big JSON files. This is like replacing a messy filing cabinet with a few neat binders—faster, cleaner, and easier to share.
    • Added “checkpointing,” which is like saving your game. If the computer crashes during a multi-day run, it can pick up right where it left off.
    • Used careful “atomic” saving so results don’t get corrupted—like writing your final answer in pencil on scratch paper and only copying it to the official sheet once you’re sure it’s correct.
  • The process: For each problem, the system fed the description and examples to a local model, got a code answer, submitted it to Kattis, and recorded the result (Accepted, Wrong Answer, Run Time Error, etc.). This took more than three weeks to run for all models.
  • How success was measured: The main score was “pass@1,” which is simply the chance that the model’s first try solves the problem.

What did they find?

  • Accuracy is modest for local models: The best two local models solved about 5% to 6% of problems on the first try:
    • Qwen2.5‑Coder: 5.7%
    • Yi‑Coder: 5.4%
    • Most others were lower.
  • Cloud models are still ahead: In earlier work on a large subset of Kattis problems, Gemini 1.5 and ChatGPT‑4 each solved around 11% on the first try. So the best local models are about half as good—for now.
  • Difficulty matters a lot:
    • Easy problems: Some got solved, but still many wrong answers or errors.
    • Medium problems: Accepted solutions dropped sharply.
    • Hard problems: Basically none were solved. Only two local models produced even a single accepted solution in the “Hard” set.
  • Common failure types: Most misses were “Wrong Answer” or “Run Time Error.” That means the models often produced code that looked right but didn’t pass all tests, or crashed when running.
  • Speed vs accuracy trade-off: Some of the more accurate local models were slower to generate code. In other words, taking more time didn’t guarantee better answers, but the best performers tended to be a bit slower.

Why does this matter?

  • Clear trade-offs: Local models protect privacy and avoid cloud costs and rate limits. You can run as many tests as you want. But today, they still lag behind top cloud AIs in accuracy on tough problems.
  • A reusable, practical setup: The upgraded, offline evaluation system is something schools, companies, and researchers can copy to test models on their own machines. It’s simpler (fewer files), safer (good saving and checkpoints), and can run for days without losing progress.
  • A tougher, fairer benchmark: Because Kattis solutions are harder to find online, this test better reflects true problem-solving ability—not just memorization.
  • Paths to improvement: The authors suggest ways to boost local models:
    • Fine-tune them on the right kind of problems,
    • Combine local and cloud models in a hybrid setup,
    • Use better prompts and built‑in debugging steps.
  • Uses in education: Even with modest accuracy, local models can help teachers build safer, privacy‑friendly tools—like “explain-as-you-grade” autograders that give step-by-step feedback, or gentle coding helpers inside students’ IDEs.

The simple takeaway

Local, open AI coding models are getting better, but they don’t yet match top cloud AIs on hard programming challenges. Still, they offer big advantages—privacy, control, and no usage fees—and now there’s a clean, fully offline way to test them at large scale. With smart fine-tuning and better workflows, local models could become much more capable, helping schools and organizations use AI safely and affordably.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper—each item is framed to guide actionable future research.

  • Exact prompts, input formatting, and system instructions used for generation are not disclosed; share the full prompt templates and context construction to ensure reproducibility and enable prompt ablations.
  • Inference hyperparameters (e.g., temperature, top‑p, top‑k, max tokens, stop sequences, repetition penalty) are unspecified; report and systematically vary them to quantify their impact on pass@1, speed, and failure modes.
  • Only 6.7–9B models were tested; evaluate larger locally deployable models (e.g., 13B–70B with quantization) to determine the accuracy/latency/VRAM trade‑off ceiling for “local” use.
  • Python was the sole target language; compare performance across languages (C++/Java/Go) to assess effects on Time Limit/Memory Limit errors and alignment with Kattis’ runtime constraints.
  • Single‑attempt (pass@1) evaluation only; measure pass@k with multi‑sample strategies (self‑consistency, diverse decoding) and quantify improvements versus cost/time.
  • No iterative repair loop was used; evaluate closed‑loop workflows (compile/run, capture error, re‑prompt/fix) and automated debugging agents on acceptance rates and failure reduction.
  • Limited error forensics; categorize compile errors (syntax/indentation/imports), wrong answers (I/O formatting vs algorithmic), and runtime errors (exceptions vs resource limits) to target corrective interventions.
  • Stochasticity is unaddressed; control and vary random seeds, repeat trials, and report variance/confidence intervals to understand performance stability.
  • Cloud vs local comparison used different problem counts; perform matched‑set, controlled comparisons on identical Kattis subsets to eliminate dataset confounds.
  • Context truncation is not analyzed; measure statement length distribution, check truncation events per model, and quantify correctness degradation versus context window size.
  • Training data contamination is assumed but not measured; estimate Kattis problem presence in model pretraining (via near‑duplicate detection) and correlate with acceptance to assess “long‑tail” effects.
  • Sampling strategy effects are unexplored; run ablations across decoding regimes (greedy, nucleus, beam, temperature sweeps) to map accuracy/speed trade‑offs.
  • Energy/cost metrics for local inference are missing; log GPU/CPU utilization, energy consumption, and throughput to compare local vs cloud cost‑effectiveness.
  • Generation time was reported but not normalized by tokens; collect tokens generated and compute tokens/sec to disentangle model speed from problem length.
  • No formal correlation analysis between generation time and acceptance; compute correlations and partial correlations controlling for problem length and difficulty.
  • Outliers were removed via the 1.5×IQR rule without further analysis; characterize the outlier cases to identify pathological prompts, models, or Kattis tasks that cause extreme latencies.
  • Artifacts are not publicly released; publish the FACE extensions, Ollama configs, JSON schemas, and per‑problem submission logs to enable replication and meta‑analysis.
  • Prompt engineering strategies were not tested; evaluate targeted I/O guidance, algorithmic templates, constraint reminders, and reasoning directives (e.g., “optimize for O(n log n)”) on correctness.
  • Tool‑use and test‑generation were omitted; integrate unit‑test synthesis, local sample‑test execution, and verification strategies (e.g., property‑based tests) before submission.
  • No error‑aware re‑prompting; design structured re‑prompt policies conditioned on failure type (compile vs wrong answer vs TLE) and measure iterative gains.
  • Retrieval or summarization for long inputs is absent; test RAG or statement summarization to fit within context limits while preserving constraints and corner cases.
  • Language‑specific runtime limits were not analyzed; quantify Time/Memory Limit Exceeded distribution by language choice and problem category to inform language selection.
  • Difficulty stratification is coarse (Easy/Medium/Hard) with unclear mapping; align with Kattis’ 1.0–10.0 difficulty ratings and report per‑bin acceptance for finer‑grained insights.
  • Untracked Kattis statuses were filtered out; list and analyze these statuses to determine whether additional failure modes inform remediation (e.g., “Judging errors,” “Presentation errors”).
  • Per‑category algorithmic performance is missing; break down acceptance by topic (graphs, DP, geometry, greedy, string processing) to identify domain‑specific weaknesses.
  • Fine‑tuning was proposed but not evaluated; run task‑specific fine‑tunes (e.g., Kattis‑style I/O + algorithmic patterns) and report gains versus base models.
  • Concurrency and throughput are not studied; profile multi‑model/multi‑GPU scheduling, batch sizes, and server configurations to optimize wall‑clock time for large runs.
  • Similarity/plagiarism risks are not assessed; compute code similarity (AST, token n‑grams) across models and problems to ensure uniqueness and avoid Kattis plagiarism flags.
  • Kattis environment specifics (Python version, permitted libraries, time/memory policy) are not documented; standardize and disclose environment assumptions to reduce avoidable failures.
  • Problem‑level outcomes are not released; provide per‑problem acceptance/failure details to facilitate targeted follow‑up studies and reproducibility checks.
  • Sample test usage in prompts is not examined; test whether emphasizing samples biases solutions that overfit visible cases and fail hidden tests, and devise countermeasures.
  • Chain‑of‑thought or plan‑then‑code prompting is not evaluated; compare direct code generation vs structured reasoning pipelines on acceptance and failure types.
  • Warm‑up/caching effects on generation time are unknown; measure first‑call latencies vs steady‑state to ensure runtime statistics reflect typical usage.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Atomic file operations: File-system writes performed in an all-or-nothing manner to prevent partial or corrupted outputs during crashes or interruptions. "The system uses atomic file operations: The temporary results file is flushed and synced to disk before being renamed to its final JSON output."
  • Box plot rule: A conventional outlier detection heuristic labeling values outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] as outliers. "we applied the conventional box plot rule and discarded any outliers."
  • Checkpointing: Periodically saving progress so long-running processes can resume from the last save point after a failure. "adding robust checkpointing so multi-day runs can resume after failures."
  • Context window: The maximum number of tokens a model can attend to in a single prompt/response. "context window size (in thousands of tokens)."
  • Cumulative distribution function (CDF): A function giving the probability (or fraction) that a random variable is less than or equal to a given value; used to summarize latency distributions. "the corresponding cumulative distribution functions (CDF)"
  • Deduplication: Removing duplicate data points from a dataset to improve training quality and reduce bias or leakage. "with extensive filtering and deduplication."
  • Direct preference optimization: An alignment method that trains models directly on pairwise preference signals to better match desired outputs. "aligned through supervised fine-tuning and direct preference optimization."
  • Flash Attention: An exact yet IO-aware, memory-efficient attention algorithm that speeds up Transformer attention computations. "optimized for code generation with Flash Attention"
  • Infilling objective: A training objective where the model learns to fill in missing spans of text/code within a sequence. "using an infilling objective."
  • Interquartile range (IQR): The spread between the 25th and 75th percentiles (Q3 − Q1), commonly used for robust dispersion and outlier detection. "to form the interquartile range:"
  • Kattis: An online judge platform with a large, plagiarism-checked set of programming problems for benchmarking. "Kattis is a publicly available platform with more than 3,500 coding problems of varying difficulty and is widely used to evaluate programming proficiency."
  • Long-tail information: Rare or underrepresented knowledge that is sparsely reflected in training data and thus harder for models to learn. "struggle with long-tail information"
  • Ollama: A local LLM inference platform for running and managing models on on-premise hardware. "through the Ollama platform"
  • Pass@k metrics: The probability that at least one of k generated attempts solves a problem; pass@1 is the single-try success rate. "The pass@kk metrics represent the probability that at least one of kk solution attempts will succeed."
  • Rate limits: Provider-imposed caps on the number of API requests over a time window that constrain large-scale experimentation. "Token quotas and rate limits hinder large-scale experiments"
  • Supervised fine-tuning: Further training a pre-trained model on labeled examples to specialize behavior for a target task. "aligned through supervised fine-tuning and direct preference optimization."
  • Token quotas: Limits on the total number of input/output tokens an API user can consume, affecting throughput and cost. "Token quotas and rate limits hinder large-scale experiments"
  • VRAM: On-GPU memory used to hold model parameters and intermediate activations during inference/training. "consumer-grade GPUs (8 GB VRAM)"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com