Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
0
Zitationen
39
Autoren
2025
Jahr
Abstract
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.
Ähnliche Arbeiten
Autoren
- Jingxuan Xu
- Ken Deng
- Weihao Li
- Shiying Yu
- Haiyan Tang
- Haoyang Huang
- Zhongping Lai
- Zizheng Zhan
- Yanan Wu
- Chenchen Zhang
- Kepeng Lei
- Yifan Yao
- Xinping Lei
- Wenqiang Zhu
- Zongxian Feng
- H. Li
- Jingyuan Xiong
- Dailin Li
- Zhi Gao
- Kun Wu
- Xiang Wen
- Zehui Zhan
- Yuanxing Zhang
- Wuxuan Gong
- Zhancheng Gao
- Guanxiang Wang
- Yunjing Xue
- Mengtong Li
- Mengfei Xie
- Xiaojiang Zhang
- Jinghui Wang
- Wenhao Zhuang
- Zheng Lin
- Huiming Wang
- Zhaoxiang Zhang
- Yuqun Zhang
- Haotian Zhang
- Bin Chen
- Jiaheng Liu