SWE-Sharp-Bench: C# & Sweden Benchmarks

Updated 10 November 2025

SWE-Sharp-Bench is a dual benchmark covering real-world C# software tasks and Sweden-specific factual recall, targeting gaps in .NET and Scandinavian QA evaluation.
The C# benchmark compiles 150 tasks from top repositories and uses detailed metrics on patch complexity, reproducible containerization, and automated validation.
The Sweden factual benchmark aggregates 1,293 bilingual QA items, emphasizing cross-lingual consistency, catastrophic forgetting, and diagnostic evaluation.

SWE-Sharp-Bench refers to two distinct academic benchmarks: one for C# software engineering tasks and one for Sweden-related factual knowledge. Both are recent, open-source efforts that address critical gaps in their respective areas—software engineering for the .NET ecosystem and factual QA for Swedish-specific topics—by adapting rigorous methodologies previously established for Python-centric and international knowledge benchmarks.

1. Definition and Scope

SWE-Sharp-Bench denotes (a) a repository-level benchmark for evaluating AI coding agents on real C# tasks (Mhatre et al., 4 Nov 2025), and (b) a manually curated diagnostic benchmark for Sweden-related factual recall, with QA items in Swedish and aligned English translations (Kunz, 24 Oct 2025). The software engineering benchmark adapts the SWE-Bench paradigm to the .NET environment, comprising 150 tasks sourced from 17 active, high-profile repositories. The diagnostic knowledge benchmark targets a different domain, collecting 1,293 closed-book QA items on Swedish personalities and events.

2. SWE-Sharp-Bench for C#: Motivation and Dataset Construction

SWE-Sharp-Bench (C#) responds to the absence of C# in existing SWE benchmarks, despite its rank as the fifth most popular programming language per TIOBE. C# projects present domain-specific challenges: .sln solution files, multiple .csproj subprojects, cross-version targeting, NuGet dependency resolution, and heterogeneous test frameworks (xUnit, NUnit, MSTest). Existing agents and benchmarks focused on Python (SWE-Bench Verified), Java and C/C++ (Multi-SWE-Bench), and JavaScript/TypeScript (SWE-PolyBench) provide no coverage of real-world .NET development complexity.

The pipeline for task curation incorporates five principal steps:

Repository Selection: Filtering top-starred repositories with recent maintenance and working build/test suites.
PR Scraping: Retaining pull requests which reference issues, modify test files, and are merged.
Automated Environment Determination: Parsing workflows and project files to generate Docker environments via a Go-based generator, capturing all build dependencies.
Execution-Based Filtering: Validating "pass → fail → pass" behavior on base, test-patch, and fix-patch runs to identify non-flaky tasks.
Manual Verification: Dual-author review for underspecification and test quality, adhering to SWE-Bench Verified standards.

This process yields a fully reproducible benchmark, with all containerization and curation scripts open-sourced.

3. Benchmark Structure, Evaluation Metrics, and Agent Performance

The C# benchmark is composed of 91 bug fixes, 47 feature requests, and 12 other (refactor/documentation/build) tasks. Complexity metrics—average files modified (mean 4.88), hunks per patch (mean 10.0), lines added/removed (mean 131.1)—exceed those of comparable Python and Java tasks, implying a more challenging distribution for LLM agents.

The standard resolution rate is defined as the percentage of agent-generated patches passing all tests. Multi-shot evaluation uses the pass@k metric: $pass@k = 1 - (1 - c/n)^k$ where $c$ is the number of correct generations out of $n$ attempts.

The strongest results on the 150-instance benchmark demonstrate a notable performance gap:

Agent-Model	Python	Java	C#
SWE-Agent + Claude 4	66.6%	18.8%	44.7%
OpenHands + GPT-5	59.0%	21.0%	47.3%

Bug-fix tasks show higher success rates (~45%) than feature requests (~30%), and lower patch complexity correlates with higher resolution rates. Despite containerization and reproducibility, the best C# configuration (OpenHands + GPT-5, 47.3%) does not match Python's ~70% for identical agents and remains under Java's ~50%. Logistic regression controlling for patch complexity (hunks, lines, files) affirms Python's relative ease, with Java and C# presenting comparable difficulty.

The Sweden-focused benchmark assembled 1,293 QA items—1,190 on "Sommar i P1" radio hosts, 102 on sporting events—each paired with human translations. Annotation utilized dual-student authorship, editing for clarity and answer minimality, and source verification from Wikipedia and official sites.

Evaluation comprises:

Factual recall (closed-book QA in both Swedish and English)
Cross-lingual consistency: probabilities of correct answers in one language given correct answers in the other
Core metrics: Exact Match (EM), token-level F1, Recall (R)
Forgetting ratio after continued pre-training (CPT) on Scandinavian corpora

Empirical findings indicate that no model exceeds 25% recall, but performance improves with model size. Multilingual models typically perform better in English prompts, except where Swedish coverage has been specifically tuned. Continued pre-training on Swedish increases Sweden-specific factual knowledge while causing substantial forgetting of prior information (35.8% in AI Sweden LLaMA-3 8B).

5. Reliability, Shortcomings, and Benchmark Integrity

SWE-Sharp-Bench (software engineering) shares its lineage with SWE-Bench Verified, entailing specific validation, filtering, and evaluation protocols (Sonwane et al., 22 Oct 2025). Patch-based validation using only PR-modified test files can inflate leaderboards: 7.8% of plausible patches are incorrect under complete test suite execution, while PatchDiff differential testing reveals that 29.6% of plausible patches induce behavioral divergence from ground-truth fixes (Wang et al., 19 Mar 2025).

PatchDiff generates synthetic tests that distinguish agent-generated patches from human fixes, identifying (i) divergent implementations of the same semantic change (46.8%), (ii) supplementary changes, (iii) absent semantic changes, and (iv) total misalignment. Manual inspection reveals 28.6% of behaviorally divergent patches are certainly incorrect, resulting in an average 6.2 percentage point inflation of reported resolution rates. Integrating PatchDiff with future benchmarks is recommended for more robust semantic evaluation.

6. Open-Source Artifacts, Extensions, and Future Directions

All SWE-Sharp-Bench data, curation pipelines, Dockerfile generation scripts, and evaluation harnesses are open-sourced on HuggingFace and GitHub. Planned extensions include increasing task volume, adding Visual Basic instances for full .NET coverage, enriching annotation for difficulty/taxonomy, and exploring synthetic augmentation strategies (e.g., SWE-Smith-style scaling) and multimodal tasks (UI screenshots).

For the Sweden-related factual knowledge benchmark, future work lies in deeper cross-lingual probing, balancing knowledge gain versus catastrophic forgetting, and refining evaluation for language-mismatched answers.

7. Significance and Impact

SWE-Sharp-Bench for C# establishes the first containerized, scalable, and reproducible yardstick for LLM and agent performance on enterprise-grade .NET codebases. It quantifies domains where Python-centric benchmarks may overstate agent capacity, identifying language-specific barriers and complexity-induced performance gaps. The Sweden-centric factual QA benchmark demonstrates the limitations of translated, non-native datasets for factual recall, revealing model size, language adaptation, and cross-lingual inconsistency dynamics that guide Scandinavian NLP research.

Collectively, these benchmarks define new standards for rigor, reproducibility, and diagnostic coverage in their domains. They expose shortcomings of current validation practices and contribute open scaffolding for frank evaluation and future methodological innovation.