CFSC Dataset: A Large-Scale Code Resource
- CFSC dataset is a large-scale corpus with 369,102 accepted Codeforces solutions spanning 809 problem classes.
- It employs a multi-stage web-scraping pipeline with detailed metadata, ensuring uniform class distribution and high data quality.
- The resource supports ML tasks such as program classification, code tagging, and resource prediction using structured test cases and specifications.
The COFO dataset (Codeforces Source Code, abbreviated as the CFSC dataset in technical contexts) is a large-scale corpus designed for machine learning research in program classification, recognition, tagging, code comprehension, and related software engineering tasks. Derived from the Codeforces competitive programming platform, COFO comprises 369,102 accepted source code solutions across 809 problem classes, with associated metadata, problem specifications, test cases, code tags, and detailed organizational structure. This resource enables researchers to model and analyze code at scale and supports a variety of ML-for-code tasks, surpassing the scope of previous datasets such as POJ-104 (Gautam et al., 24 Mar 2025).
1. Collection, Curation, and Filtering Methodology
The COFO dataset was constructed with a multi-stage web-scraping pipeline tailored to the Codeforces environment. Metadata, including problem indices, contest identifiers, and code tags, was acquired using the Codeforces public API. For each problem, comprehensive natural-language statements, input/output (I/O) specifications, and example test cases were extracted via BeautifulSoup from problem web pages. Program submissions and additional problem test cases were captured using a Selenium-driven headless browser, strategically loading submissions pages (up to 50 accepted solutions per load) and, via interactive navigation, obtaining full test-case suites associated with the first accepted solution.
Each collected source code submission was retained only if it was marked "accepted" by the Codeforces grading system—guaranteeing compilability and correctness relative to the collected test cases. For each language/problem pair, a minimum of 10 and a maximum of 750 accepted solutions was enforced, balancing data abundance with class distribution uniformity and mitigating long-tail effects.
The final dataset encompasses:
- Problems (classes): 809
- Accepted code samples:
- Supported languages (sample counts):
- C 11: 26,449
- C++ 11: 92,015
- C++ 14: 76,873
- C++ 17: 97,926
- Java 8: 33,919
- Java 11: 14,876
- Python 3: 27,044
Scraper failures and partial retrievals were logged, and the process halted after exhausting available submissions or reaching the cap per problem.
2. Data Architecture and Representation
COFO's on-disk organization mirrors its hierarchical semantics. The root directory contains one subdirectory per problem (problemID), where each problem is characterized by the following files:
specifications.txt: Contains the full problem statement, input and output format descriptions, and declared constraints (e.g., time and memory limits).testcases.txt: Lists newline-separated input/output pairs extracted from Codeforces for validation.tags.txt: Encodes ground-truth code tags, space-separated.submissions/: Houses multiple language-specific subdirectories (e.g.,C11/,C++11/,C++14/,C++17/,Java8/,Java11/,Python3/), each containing source-code files named by their submission IDs.
The logical metadata schema per problem is formalized as follows:
| Field | Type | Example / Notes |
|---|---|---|
| problemID | String | "1354B" |
| contestID | Integer | 1354 |
| index | String | "B" |
| tags | List<String> | "greedy", "implementation" |
| specification | String | Full textual description |
| input_output | Dict | { time_limit_ms, memory_limit_mb, input_format, output_format, constraints } |
| test_cases | List<Dict> | { "input", "output" } pairs |
| submissions | List<Dict> | { submissionID, language, code_filepath } |
A canonical instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
{
"problemID": "1354B",
"contestID": 1354,
"index": "B",
"tags": ["greedy", "implementation"],
"specification": "You are given N and a sequence of integers…",
"input_output": {
"time_limit_ms": 1000,
"memory_limit_mb": 256,
"input_format": "First line contains T…",
"output_format": "For each test case output…",
"constraints": "1 ≤ T ≤ 1e4; sum of all N ≤ 2e5"
},
"test_cases": [
{ "input": "3\n3\n1 2 3\n", "output": "2\n" },
{ "input": "1\n5\n5 4 3 2 1\n", "output": "0\n" }
],
"submissions": [
{ "submissionID": 85932123, "language": "C++17", "code_filepath": "submissions/C++17/85932123.cpp" },
{ "submissionID": 85933045, "language": "Python3", "code_filepath": "submissions/Python3/85933045.py" }
]
} |
3. Statistical Profile and Quantitative Attributes
Class and Language Distribution
Let denote the total number of code samples, with representing the count for problem class . The class probability for each problem is computed as . For instance, a class with 456 samples has .
COFO's language breakdown (by sample percentage) is:
| Language | Sample Count | Percentage |
|---|---|---|
| C++17 | 97,926 | 26.5% |
| C++11 | 92,015 | 24.9% |
| C++14 | 76,873 | 20.8% |
| Java 8 | 33,919 | 9.2% |
| Python 3 | 27,044 | 7.3% |
| C 11 | 26,449 | 7.2% |
| Java 11 | 14,876 | 4.0% |
Code Length
The dataset contains 17.6 million lines of code (LOC), leading to an average code length per sample of LOC:
Let denote the length of sample . The mean and variance are given by:
It follows that code-length variance and class/sample imbalance analyses are directly supported by the available metadata and formulas.
4. Metadata Enrichment and Tagging Schema
COFO encodes comprehensive metadata per problem, supporting sophisticated downstream tasks. Code tags are derived from Codeforces and represent functional or structural attributes of problems. There are 1955 total tag occurrences, with 35 unique tags (e.g., "greedy", "dp", "graphs", "math", "sortings", "strings", "bruteforce"). Tag cardinality per problem ranges from 0 to 8, with the mode at 2. All tags for a problem are stored in its tags.txt.
Problem statement texts, structured I/O descriptions, and constraints are consolidated in specifications.txt, standardizing human- and machine-readability. The storage of full test case sets in testcases.txt and time/memory limits in the input/output metadata provides granular ground-truth for benchmarking code behavior and resource predictions.
5. Machine Learning Applications and Baseline Evaluation Strategies
COFO is engineered for ML-for-code tasks requiring scale, semantic annotations, and code diversity. Standard application paradigms include:
- Program classification/recognition: Assignment of a source code solution to one of the 809 task classes; facilitates clustering, supervised or few-shot learning.
- Code tagging: Prediction of the appropriate set of code tags (multi-label classification) for unseen code, using tags from
tags.txtas ground truth. - Predicting program properties: Inference of resource bounds (e.g., time/memory consumption) via code analysis, leveraging the explicit I/O constraints per problem.
- Code comprehension and summarization: Model training on problem description ↔ code pairs, supporting tasks like code search, code synthesis, and natural language-to-code mapping.
Protocols are not mandated; typical practice is an 80/10/10 stratified split into training, validation, and test sets at the class level. Benchmarking leverages accuracy and top- metrics, with POJ-104 baselines (e.g., tree-based CNNs, RNNs) as comparators. A key distinction is COFO's scale: it is an order of magnitude larger in both classes and solution count than POJ-104 (104 tasks, 52,000 programs).
The scraping toolchain and dataset are available at https://github.com/kgautam01/CodeForces-Scraper, enabling further replication and extension.
6. Comparative Context and Research Impact
COFO (CFSC dataset) is situated within the broader movement towards Big Code datasets for empirical code analysis, ML-based code classification, and program synthesis research. Its structural richness, explicit linking of code to rich problem metadata, code tags, and granular I/O constraints aligns it with contemporary demands for benchmark diversity, scale, and annotation density.
A notable implication is the improved granularity for evaluation and generalization in code classification research relative to prior benchmarks. The inclusion of normalized directory structures and machine-parseable metadata facilitates reproducible experiments and downstream transfer learning.
Future work may focus on augmenting COFO with additional languages, problem domains, or richer static/dynamic code analyses. The dataset's design is modular, supporting domain adaptation and benchmarking for varied ML-for-code paradigms (Gautam et al., 24 Mar 2025).