Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Holistic evaluation of LLM-Based Code Generation
0
Zitationen
2
Autoren
2025
Jahr
Abstract
The rapid development of large language models (LLMs) in recent years has fundamentally changed software development. In particular, the ability of modern language models to generate source code not only enables faster prototyping but also promises to relieve developers of routine tasks. At the same time, however, there is growing concern that automatically generated code does not meet the requirements of professional software projects in terms of efficiency, maintainability and reliability. Previous studies have focused primarily on singular metrics addressing functional correctness or runtime efficiency of generated solutions. However, a holistic view that takes into account not only correctness but also algorithmic runtime efficiency and code quality is still lacking. Additionally, previous work has mainly compared LLMs that have been explicitly designed for coding tasks. However, many software developers use easily accessible LLMs that are not explicitly dedicated to coding tasks. In this work, four state-of-the-art large language models (Llama3-70b-versatile, Deepseek-V3, GPT-4.1-nano, and Llama4-maverick-instruct-basic) are compared with respect to their Python code generation capabilities. Using four benchmarks (HumanEval, MBPP, APPS, and Mercury Beyond). Functional correctness (pass@k), algorithmic runtime efficiency (Mercury- Beyond-Score) and maintainability (Cyclomatic Complexity & Maintainability Index) are evaluated. The results are then compared to findings of previous studies that focus on coding LLMs, revealing the fact that the difference in the performance of models dedicated to coding strongly depends on the chosen benchmark. The new Mercury benchmark was identified as particularly helpful for working out differences in functional correctness and runtime efficiency. Our findings also underline the necessity to broaden the scope to include maintainability. For example, the code generated by GPT-4.1-nano has good runtime efficiency compared to the others, yet lacks maintainability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.100 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.466 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.