Performance beyond SWE-bench
Determine the performance of the SERA-32B coding agent on coding benchmarks and tasks other than SWE-bench Verified, and characterize its broader generalization and potential failure modes in those settings.
References
While this suggests our results may generalize to some degree, we have not validated our model on other coding benchmarks or tasks, and we do not know how well it performs more broadly.
— SERA: Soft-Verified Efficient Repository Agents
(2601.20789 - Shen et al., 28 Jan 2026) in Section 9 (Limitations), Evaluation only on SWE-bench