Task Complexity Predictor

Updated 13 November 2025

Task Complexity Predictor is a machine learning system that estimates programming task complexity by classifying cognitive, algorithmic, and computational challenges.
It employs varied methodologies such as transformer-based models, in-context learning, and feature-based classifiers to predict difficulty and manage resource-efficient code routing.
TCPs integrate into workflows for expert task assignment, grading automation, and static analysis, ultimately enhancing code generation and system performance.

A Task Complexity Predictor (TCP) is a class of machine learning systems designed to estimate or classify the complexity—cognitive, algorithmic, or computational—of programming tasks, code snippets, or problem statements. TCPs are built on diverse methodologies, including supervised and semi-supervised learning, transformer-based architectures, feature-based models, and symbolic program analysis. The objectives and design space of TCPs span assignment of tasks to experts, resource-efficient code routing, automatic grading, and time-complexity estimation for static analysis workflows. Recent advances have leveraged large pretraining, in-context learning, data augmentation, and co-training to adapt TCPs to low-resource and heterogeneous data settings. Performance is measured primarily by classification metrics (accuracy, (macro-)precision, recall, F1-score), but also by their impact on overall system efficiency and correctness in downstream applications such as LLM-assisted code generation.

1. Task Complexity Problem Formulations

TCPs address several distinct but related prediction settings:

Difficulty Classification from Natural-Language Task Descriptions Tasks are classified into discrete categories (e.g., Easy, Medium, Hard) or continuous scales (e.g., 1–9.7). Inputs are natural-language statements, often accompanied by sample inputs/outputs and metadata. Example: (Rasheed et al., 30 Sep 2024)
Model-aware Complexity Labeling for Code Generation Routing Complexity is defined as the minimal model size required for successful task completion (e.g., code-gen correctness by LLMs of increasing power). Example: (Bae et al., 2023)
Time Complexity Class Assignment for Code The TCP predicts asymptotic time-complexity class ( $O(1)$ , $O(\log n)$ , $O(n)$ , etc.) from source code, via both static features and code embeddings. Example: (Sikka et al., 2019, Hahn et al., 10 Feb 2025)

This diversity of problem formulation dictates data requirements, labeling procedures, and integration points within broader software engineering or ML pipelines.

2. Dataset Construction and Labeling Strategies

High-quality labeled datasets are critical for TCP development:

Paper (arXiv)	Domain/Input	Label Space	Labeling Protocol
(Rasheed et al., 30 Sep 2024)	Task statement	Easy/Medium/Hard, 1–9.7	Scraped 4,112 tasks from Kattis, LeetCode, HackerRank, Topcoder; platform/guideline-based score mapping
(Bae et al., 2023)	NL task (MBPP)	Discrete levels 1–5	Empirical LLM performance: c=1 ↔ smallest reliably solves, c=5 ↔ only GPT-4
(Sikka et al., 2019, Hahn et al., 10 Feb 2025)	Source code	$\mathcal{O}$ classes	Manual expert annotation (CoRCoD); symbolic/LLM cross-validation (TCProF)

(Rasheed et al., 30 Sep 2024) standardizes metadata via HTML scraping, with heuristic mapping of platform-provided scores.
(Bae et al., 2023) yields a model-referenced labeling: for each sample, $c_i$ is determined by systematic trials across increasing LLM capacities, using deterministic thresholds on pass counts for each.
(Sikka et al., 2019) constructs CoRCoD (932 Java codes) with expert Big-O annotation; (Hahn et al., 10 Feb 2025) extends this to multilingual CodeComplex (≈5000 programs/language, 7 classes).
(Hahn et al., 10 Feb 2025) introduces semi-automated symbolic rules (module $\mathbb{S}\text{ym}$ ) for robust label assignment when annotation or model confidence is insufficient.

3. Model Architectures and Learning Paradigms

TCPs leverage a variety of machine learning architectures and learning paradigms:

Fine-Tuned Transformer Models:
- FLAN-T5-Small: Encoder-decoder (77M params) fine-tuned for discrete classification; input includes concatenated metadata, preamble prompt, and is tokenized/truncated at 1300 tokens.
- DaVinci-002 (GPT-3.5-class): Decoder-only (≈175B params) fine-tuned on model-aware labels via multiclass cross-entropy (Bae et al., 2023).
In-Context Learning (ICL):
- GPT-4o-mini: Uses few-shot prompting (e.g., three exemplars, one per class) for test-time classification (Rasheed et al., 30 Sep 2024). Prompt includes labeled task demonstrations and a one-word output constraint.
- Zero- and few-shot LLM baselines: For time-complexity prediction, off-the-shelf LLMs underperform filtering/SSL pipelines (Hahn et al., 10 Feb 2025).
Feature- and Graph-Embedding Based Classifiers:
- Classical ML: Random Forest, SVM, shallow MLP on hand-crafted static features (nested-loop depth, use of sorts, recursion, etc.) (Sikka et al., 2019).
- Graph2Vec: Weisfeiler–Lehman rooted subgraph embeddings (1024-d vectors), input to SVM/MLP classifiers.
Semi-supervised and SSL Architectures (TCProF):
- Peer co-training: Two separate models (original and augmented data) iteratively cross-exchange high-confidence pseudo-labeled samples, bootstrapped with symbolic pseudo-labeling when confidence < θ (Hahn et al., 10 Feb 2025).
- Data augmentation: Complexity-preserving transformations via back-translation and loop-conversion expand the labeled set in low-resource settings.

4. Training Protocols and Evaluation Metrics

Training protocols are dictated by data and paradigm:

Supervised Fine-Tuning:
- Cross-entropy loss, 80/20 splits (or k-fold CV); optimizer and LR per architecture.
- FLAN-T5-Small: AdamW, LR $5 \times 10^{-4}$ , batch size 8, epochs 3–5, macro-F1 as target metric.
In-Context Learning:
- Prompt engineering: consistent label phrasing, clear task structure, controlled sampling/decoding settings (temperature ≈ 0.7).
SSL/Co-Training:
- Peer model updates on original/augmented data.
- Pseudo-label inclusion based on confidence threshold ( $\theta=0.7$ typical).
- Ramp-up of consistency loss via schedule:
$\lambda(t) = \lambda_{\text{max}} \cdot \exp[-5 \cdot (1-t/T)^2]$
Evaluation Metrics:

| Metric | Formula Example (macro) | |----------------|-------------------------------------------------------------------------| | Accuracy | $\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}$ | | Precision | $\frac{\mathrm{TP}_k}{\mathrm{TP}_k+\mathrm{FP}_k}$ for class $k$ | | Recall | $\frac{\mathrm{TP}_k}{\mathrm{TP}_k+\mathrm{FN}_k}$ for class $k$ | | $\mathrm{F}_1$ | $2 \cdot \frac{\text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}_k}$ |

Weighted versions are standard for imbalanced classes (Sikka et al., 2019). Macro-averaging is common for fair comparison across heterogeneous classes (Rasheed et al., 30 Sep 2024).

5. Empirical Performance and Comparative Analysis

Model / Benchmark	Accuracy	Metric Notes	Source
FLAN-T5-Small	52.24 %	Macro F1: 47.17 %	(Rasheed et al., 30 Sep 2024)
GPT-4o-mini (3-shot ICL)	57.00 %	Macro F1: 53.99 %	(Rasheed et al., 30 Sep 2024)
DaVinci-002 (non-finetuned)	34 %	on MBPP	(Bae et al., 2023)
DaVinci-002 (finetuned)	79 %	on MBPP	(Bae et al., 2023)
CoRCoD (Feature-RF, 5-way)	74.3 %	Weighted F1 ≈ 73 %	(Sikka et al., 2019)
CoRCoD (Graph2Vec, SVM, 5-way)	73.9 %	Weighted Precision: 74 %	(Sikka et al., 2019)
TCProF(UniXcoder, 10-shot, Python)	70.3 %	Exceeds GPT-4 (53 %)	(Hahn et al., 10 Feb 2025)

Key observations:

Few-shot ICL with strong LLMs (GPT-4o, GPT-4) outperforms fine-tuned small models by 4–6 pp on natural-language benchmarks, but still underperforms code-complexity SSL approaches with augmentation in low-resource code labeling (Rasheed et al., 30 Sep 2024, Hahn et al., 10 Feb 2025).
Peer co-training (TCProF) with data augmentation yields a > 64 % relative gain over deep self-training or LLM baselines in few-shot settings, attaining 70.3 % on 10-shot splits (Hahn et al., 10 Feb 2025).
Feature-based models remain competitive for classical time-complexity labeling if high-quality code metadata is available, and feature ablation pinpoints nested-loop depth and sort-call counts as highest-value features (Sikka et al., 2019).

6. Integration Mechanisms and Practical Use Cases

TCPs are integrated into a variety of practical workflows:

Expert Task Assignment:

TCPs are used to triage new programming problems and partition them to subject matter experts or automated solvers of appropriate skill/power (Rasheed et al., 30 Sep 2024).

Model Routing for Code Generation:

TCP output directly determines which LLM is used for code generation, optimizing for both accuracy and computational cost. For example, in (Bae et al., 2023),

c = TCP.predict(x)  # returns 1…5
if c <= 2:
    answer = CodeLlama7B.generate(x)
elif c <= 4:
    answer = GPT3_5.generate(x)
else:
    answer = GPT4.generate(x)
return answer

This strategy achieves 90% reduction in compute cost with only minor (13.3 pp) reduction in code-gen accuracy as measured by test assertions.

IDE Integration, Grading, and Static Analysis:
- Plugins parse code on-save, predict complexity, and visualize estimated Big-O in the editor (Sikka et al., 2019).
- TCPs in grading pipelines compare predicted complexity to instructor thresholds, automating flagging or feedback generation.
- Static analysis tools use TCPs to signal high-complexity code in linters (e.g., SonarQube integration).
Low-Resource Time Complexity Prediction:

TCProF’s SSL strategy enables meaningful classification even with 5–10 labeled examples per class via co-training and symbolic module fallback (Hahn et al., 10 Feb 2025). This is especially relevant when annotation cost or data scarcity impedes full supervision.

7. Challenges, Limitations, and Open Problems

TCP research is constrained by several fundamental and empirical limitations:

Theoretical Undecidability:

Time-complexity prediction for arbitrary code is not computable (via the Halting Problem). All TCPs are fundamentally heuristics or approximate predictors (Sikka et al., 2019, Hahn et al., 10 Feb 2025).

Labeling Ambiguity and Granularity:

Discrete classes cannot perfectly encode either cognitive task difficulty or algorithmic complexity, and program annotations require careful protocol to handle multi-input or composite behaviors (Sikka et al., 2019).

Class Imbalance and Edge Cases:

Hard/complex tasks are underrepresented, leading to lowest per-class recall, and fine distinctions (e.g., $O(n)$ vs $O(n^2)$ if inner loop does not involve input size) remain poorly handled (Rasheed et al., 30 Sep 2024, Hahn et al., 10 Feb 2025).

Generalization, Language, and Codebase Diversity:

Label mapping may be nonportable between codebases or across languages without careful field/feature normalization (Rasheed et al., 30 Sep 2024, Hahn et al., 10 Feb 2025).

Semi-supervised and Zero-shot Settings:

Most SSL designs, including TCProF, are tailored for few-shot settings. Zero-shot prediction with only rule-based or bootstrapped labels remains an open challenge (Hahn et al., 10 Feb 2025).

A plausible implication is that continued improvements in data augmentation, symbolic analysis, multi-model ensembling, and integration of dynamic code semantics may be necessary to achieve robust, domain- and language-adaptive TCPs.

Task Complexity Predictors constitute an active intersection of code intelligence, resource-aware ML systems, and software classification, with applications from curriculum design and automated grading to the deployment of efficient hybrid LLM systems. Advances in this domain are predicated on the careful interplay of scalable datasets, sophisticated modeling, and practical deployment strategies under constraints of labeling cost, operational efficiency, and theoretical computability.