CFSC Dataset: A Large-Scale Code Resource

Updated 7 February 2026

CFSC dataset is a large-scale corpus with 369,102 accepted Codeforces solutions spanning 809 problem classes.
It employs a multi-stage web-scraping pipeline with detailed metadata, ensuring uniform class distribution and high data quality.
The resource supports ML tasks such as program classification, code tagging, and resource prediction using structured test cases and specifications.

The COFO dataset (Codeforces Source Code, abbreviated as the CFSC dataset in technical contexts) is a large-scale corpus designed for machine learning research in program classification, recognition, tagging, code comprehension, and related software engineering tasks. Derived from the Codeforces competitive programming platform, COFO comprises 369,102 accepted source code solutions across 809 problem classes, with associated metadata, problem specifications, test cases, code tags, and detailed organizational structure. This resource enables researchers to model and analyze code at scale and supports a variety of ML-for-code tasks, surpassing the scope of previous datasets such as POJ-104 (Gautam et al., 24 Mar 2025).

1. Collection, Curation, and Filtering Methodology

The COFO dataset was constructed with a multi-stage web-scraping pipeline tailored to the Codeforces environment. Metadata, including problem indices, contest identifiers, and code tags, was acquired using the Codeforces public API. For each problem, comprehensive natural-language statements, input/output (I/O) specifications, and example test cases were extracted via BeautifulSoup from problem web pages. Program submissions and additional problem test cases were captured using a Selenium-driven headless browser, strategically loading submissions pages (up to 50 accepted solutions per load) and, via interactive navigation, obtaining full test-case suites associated with the first accepted solution.

Each collected source code submission was retained only if it was marked "accepted" by the Codeforces grading system—guaranteeing compilability and correctness relative to the collected test cases. For each language/problem pair, a minimum of 10 and a maximum of 750 accepted solutions was enforced, balancing data abundance with class distribution uniformity and mitigating long-tail effects.

The final dataset encompasses:

Problems (classes): 809
Accepted code samples: $N = 369,102$
Supported languages (sample counts):
- C 11: 26,449
- C++ 11: 92,015
- C++ 14: 76,873
- C++ 17: 97,926
- Java 8: 33,919
- Java 11: 14,876
- Python 3: 27,044

Scraper failures and partial retrievals were logged, and the process halted after exhausting available submissions or reaching the cap per problem.

2. Data Architecture and Representation

COFO's on-disk organization mirrors its hierarchical semantics. The root directory contains one subdirectory per problem ( $\langle$ problemID $\rangle$ ), where each problem is characterized by the following files:

specifications.txt: Contains the full problem statement, input and output format descriptions, and declared constraints (e.g., time and memory limits).
testcases.txt: Lists newline-separated input/output pairs extracted from Codeforces for validation.
tags.txt: Encodes ground-truth code tags, space-separated.
submissions/: Houses multiple language-specific subdirectories (e.g., C11/, C++11/, C++14/, C++17/, Java8/, Java11/, Python3/), each containing source-code files named by their submission IDs.

The logical metadata schema per problem is formalized as follows:

Field	Type	Example / Notes
problemID	String	"1354B"
contestID	Integer	1354
index	String	"B"
tags	List<String>	"greedy", "implementation"
specification	String	Full textual description
input_output	Dict	{ time_limit_ms, memory_limit_mb, input_format, output_format, constraints }
test_cases	List<Dict>	{ "input", "output" } pairs
submissions	List<Dict>	{ submissionID, language, code_filepath }

A canonical instance:

{
  "problemID": "1354B",
  "contestID": 1354,
  "index": "B",
  "tags": ["greedy", "implementation"],
  "specification": "You are given N and a sequence of integers…",
  "input_output": {
    "time_limit_ms": 1000,
    "memory_limit_mb": 256,
    "input_format": "First line contains T…",
    "output_format": "For each test case output…",
    "constraints": "1 ≤ T ≤ 1e4; sum of all N ≤ 2e5"
  },
  "test_cases": [
    { "input": "3\n3\n1 2 3\n", "output": "2\n" },
    { "input": "1\n5\n5 4 3 2 1\n", "output": "0\n" }
  ],
  "submissions": [
    { "submissionID": 85932123, "language": "C++17", "code_filepath": "submissions/C++17/85932123.cpp" },
    { "submissionID": 85933045, "language": "Python3", "code_filepath": "submissions/Python3/85933045.py" }
  ]
}

3. Statistical Profile and Quantitative Attributes

Class and Language Distribution

Let $N = 369,102$ denote the total number of code samples, with $n_i$ representing the count for problem class $i$ . The class probability for each problem is computed as $P_i = n_i / N$ . For instance, a class with 456 samples has $P \approx 0.00124$ .

COFO's language breakdown (by sample percentage) is:

Language	Sample Count	Percentage
C++17	97,926	26.5%
C++11	92,015	24.9%
C++14	76,873	20.8%
Java 8	33,919	9.2%
Python 3	27,044	7.3%
C 11	26,449	7.2%
Java 11	14,876	4.0%

Code Length

The dataset contains 17.6 million lines of code (LOC), leading to an average code length per sample of $\approx 47.7$ LOC:

$\text{Average lines/sample} = \frac{17,600,000}{369,102} \approx 47.7$

Let $L_j$ denote the length of sample $j$ . The mean and variance are given by:

$\mu = \frac{1}{N}\sum_{j=1}^{N} L_j \ \sigma^2 = \frac{1}{N}\sum_{j=1}^{N} (L_j - \mu)^2$

It follows that code-length variance and class/sample imbalance analyses are directly supported by the available metadata and formulas.

4. Metadata Enrichment and Tagging Schema

COFO encodes comprehensive metadata per problem, supporting sophisticated downstream tasks. Code tags are derived from Codeforces and represent functional or structural attributes of problems. There are 1955 total tag occurrences, with 35 unique tags (e.g., "greedy", "dp", "graphs", "math", "sortings", "strings", "bruteforce"). Tag cardinality per problem ranges from 0 to 8, with the mode at 2. All tags for a problem are stored in its tags.txt.

Problem statement texts, structured I/O descriptions, and constraints are consolidated in specifications.txt, standardizing human- and machine-readability. The storage of full test case sets in testcases.txt and time/memory limits in the input/output metadata provides granular ground-truth for benchmarking code behavior and resource predictions.

5. Machine Learning Applications and Baseline Evaluation Strategies

COFO is engineered for ML-for-code tasks requiring scale, semantic annotations, and code diversity. Standard application paradigms include:

Program classification/recognition: Assignment of a source code solution to one of the 809 task classes; facilitates clustering, supervised or few-shot learning.
Code tagging: Prediction of the appropriate set of code tags (multi-label classification) for unseen code, using tags from tags.txt as ground truth.
Predicting program properties: Inference of resource bounds (e.g., time/memory consumption) via code analysis, leveraging the explicit I/O constraints per problem.
Code comprehension and summarization: Model training on problem description ↔ code pairs, supporting tasks like code search, code synthesis, and natural language-to-code mapping.

Protocols are not mandated; typical practice is an 80/10/10 stratified split into training, validation, and test sets at the class level. Benchmarking leverages accuracy and top- $k$ metrics, with POJ-104 baselines (e.g., tree-based CNNs, RNNs) as comparators. A key distinction is COFO's scale: it is an order of magnitude larger in both classes and solution count than POJ-104 (104 tasks, $~$ 52,000 programs).

The scraping toolchain and dataset are available at https://github.com/kgautam01/CodeForces-Scraper, enabling further replication and extension.

6. Comparative Context and Research Impact

COFO (CFSC dataset) is situated within the broader movement towards Big Code datasets for empirical code analysis, ML-based code classification, and program synthesis research. Its structural richness, explicit linking of code to rich problem metadata, code tags, and granular I/O constraints aligns it with contemporary demands for benchmark diversity, scale, and annotation density.

A notable implication is the improved granularity for evaluation and generalization in code classification research relative to prior benchmarks. The inclusion of normalized directory structures and machine-parseable metadata facilitates reproducible experiments and downstream transfer learning.

Future work may focus on augmenting COFO with additional languages, problem domains, or richer static/dynamic code analyses. The dataset's design is modular, supporting domain adaptation and benchmarking for varied ML-for-code paradigms (Gautam et al., 24 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CFSC Dataset.

CFSC Dataset: A Large-Scale Code Resource

1. Collection, Curation, and Filtering Methodology

2. Data Architecture and Representation

3. Statistical Profile and Quantitative Attributes

Class and Language Distribution

Code Length

4. Metadata Enrichment and Tagging Schema

5. Machine Learning Applications and Baseline Evaluation Strategies

6. Comparative Context and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CFSC Dataset: A Large-Scale Code Resource

1. Collection, Curation, and Filtering Methodology

2. Data Architecture and Representation

3. Statistical Profile and Quantitative Attributes

Class and Language Distribution

Code Length

4. Metadata Enrichment and Tagging Schema

5. Machine Learning Applications and Baseline Evaluation Strategies

6. Comparative Context and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research