Evaluate AI agents in bug fixing.
SWE-bench is a benchmarking tool designed to evaluate the performance of AI agents in fixing real bugs within open-source repositories. Initially launched in August 2024, the tool revealed that the leading model could resolve only 33% of issues. However, by the following year, top models consistently achieved scores above 70%. SWE-bench offers several datasets tailored for different evaluation needs: SWE-bench Verified, a human-filtered subset; SWE-bench Lite, a cost-effective evaluation subset; SWE-bench Bash Only, utilizing the Verified dataset with the mini-SWE-agent environment; and SWE-bench Multimodal, which includes issues with visual elements. Each dataset entry reports the % Resolved metric, indicating the percentage of instances solved out of various total counts. The tool continues to evolve, with recent updates including the introduction of SWE-bench Multimodal and the release of SWE-smith for training custom software engineering models.