Performance beyond SWE-bench

Determine the performance of the SERA-32B coding agent on coding benchmarks and tasks other than SWE-bench Verified, and characterize its broader generalization and potential failure modes in those settings.

Background

All evaluations in the paper are conducted on SWE-bench Verified, and while the model appears effective in internal use, some scaffold-specific behaviors remain from training.

The authors explicitly state they have not validated the model on other coding benchmarks or tasks and do not know its broader performance.

References

While this suggests our results may generalize to some degree, we have not validated our model on other coding benchmarks or tasks, and we do not know how well it performs more broadly.

— SERA: Soft-Verified Efficient Repository Agents (2601.20789 - Shen et al., 28 Jan 2026) in Section 9 (Limitations), Evaluation only on SWE-bench

Performance beyond SWE-bench

Background

References

Related Problems