BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Published 19 Mar 2025 in cs.CL, cs.AI, and cs.CC | (2503.15242v2)

Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative LLMs in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art LLMs on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Abstract PDF Upgrade to Chat

Summary

The paper introduces BigO(Bench), a systematic framework integrating dynamic profiling and regression to accurately evaluate if LLMs can generate code with controlled time and space complexity.
Experimental results show that despite strong general code generation abilities, current LLMs significantly struggle with algorithmic complexity constraints, particularly when required to generate intentionally suboptimal solutions.
Findings imply that integrating dynamic profiling and complexity evaluation tools into LLM pipelines is crucial for generating efficient code in resource-sensitive environments, highlighting a gap in current models' inherent complexity reasoning.

Overview

The paper “BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?” (2503.15242) develops a systematic framework to evaluate whether LLMs can generate correct code while also satisfying strict computational complexity constraints. The study bridges the gap between pure code synthesis and complexity-aware code generation by leveraging both profiling techniques and non-negative least squares regression to map execution metrics to theoretical complexity classes.

Methodological Framework

The methodology integrates a dynamic complexity inference framework into a benchmark that encompasses 3,105 diverse coding problems and over 1.19 million solutions, primarily sourced from competitive programming repositories. The framework operates in three phases:

Complexity Prediction: LLMs receive a problem description and a candidate solution. The system infers runtime and memory characteristics from detailed profiling measurements. These measured data points are subsequently used to estimate the algorithm’s complexity using regressions and ensemble techniques.
Complexity Generation: The benchmark challenges the LLMs to generate code that is not only functionally correct but also constrained to satisfy user-specified time or space complexity bounds. This tests the models’ ability to synthesize algorithms that intentionally follow less-optimized paths, an area where memorized code patterns may fall short.
Coefficient Ranking: Post generation, solutions are ranked against a pool of human-authored code according to inferred complexity coefficients. This ranking system quantitatively evaluates the optimization level of each generated solution within its designated complexity class.

The profiling methodology employs integrated tools such as Bubblewrap for sandboxing, Cprofiler for runtime measurements, and tracemalloc for monitoring memory usage. By varying input sizes and capturing execution metrics, the framework fits a curve to these data to identify the predominant complexity class with 92% accuracy for time and 84% accuracy for space compared to human theoretical annotations.

Experimental Evaluation

The paper rigorously evaluates 12 contemporary LLMs including several variants of Llama 3.1 (ranging from 8B to 405B parameters), Codestral 22B, GPT-4o, and multiple iterations within the DeepSeek family. The evaluation was performed in a zero-shot setting unless explicitly noted and involved standard metrics such as Pass@k, Best@k, and All@k. Noteworthy points include:

Complexity Understanding vs. Code Generation: While token-space reasoning models exhibit high performance on typical code generation tasks, they consistently lag in tasks involving algorithmic complexity constraints. For instance, solutions generated by the Llama 3.1 405B variant achieved superior performance on space complexity prediction tasks relative to their pure code synthesis metrics.
Marginal Gains Through Fine-Tuning: Even when fine-tuned on specific complexity-centric subsets of the benchmark (e.g., for Llama 3.1 70B), the incremental performance improvement was modest. This suggests an inherent difficulty for current models to internalize complexity reasoning when such conditions are not naturally induced during training.
Underperformance on Non-Optimal Complexity Classes: The models are proficient at recalling optimized algorithm snippets, but they underperform when required to generate code belonging to intentionally sub-optimal complexity classes. This points to an overfitting of LLMs to highly optimized, commonly encountered patterns rather than a more generalized complexity reasoning capability.

These experimental results solidify the claim that current LLMs, although effective at code generation, exhibit significant gaps in understanding and controlling computational complexity—a gap that has direct implications for generating code in real-world, resource-sensitive environments.

Practical Implications and Deployment Considerations

For practical deployment in scenarios where controlled computational complexity is critical, the findings suggest several considerations:

Profiling Integration: Incorporate dynamic profiling tools within the code generation pipeline to ensure that candidate solutions adhere to desired complexity bounds. This involves integrating sandboxed execution contexts and runtime monitors (e.g., Cprofiler and tracemalloc).
Fine-Tuning Strategies: While fine-tuning on complexity-specific objectives yields only marginal gains, a targeted approach that includes adversarial examples—i.e., solutions that intentionally deviate from optimal complexity patterns—may improve performance.
Model Selection: Given the disparity between token-based code synthesis and complexity reasoning, it may be more effective to employ a hybrid approach: using a state-of-the-art LLM for code generation and then subjecting its outputs to a post-processing module that evaluates and optimizes computational complexity.
Benchmark Integration: BigO(Bench) could serve as a regular evaluation metric within CI/CD pipelines for code generation systems, ensuring that any generated code meets predefined computational constraints, especially in embedded systems or cloud resources where performance and cost are tightly coupled.

Conclusion

The BigO(Bench) paper introduces a robust and technically detailed benchmark that highlights the limitations of contemporary LLMs in generating code constrained by specific time and space complexity requirements. The integration of dynamic profiling, regression-based complexity inference, and a comprehensive set of annotated coding problems provides a concrete pathway for evaluating and subsequently improving LLMs in contexts where computational efficiency is as critical as correctness. This work underlines the necessity for further research in complexity-aware model training and reinforces the notion that achieving high functional coverage in code generation must be balanced with adherence to computational constraints, a balance that will become increasingly important in both resource-limited and performance-critical applications.