Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Holistic evaluation of LLM-Based Code Generation

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

The rapid development of large language models (LLMs) in recent years has fundamentally changed software development. In particular, the ability of modern language models to generate source code not only enables faster prototyping but also promises to relieve developers of routine tasks. At the same time, however, there is growing concern that automatically generated code does not meet the requirements of professional software projects in terms of efficiency, maintainability and reliability. Previous studies have focused primarily on singular metrics addressing functional correctness or runtime efficiency of generated solutions. However, a holistic view that takes into account not only correctness but also algorithmic runtime efficiency and code quality is still lacking. Additionally, previous work has mainly compared LLMs that have been explicitly designed for coding tasks. However, many software developers use easily accessible LLMs that are not explicitly dedicated to coding tasks. In this work, four state-of-the-art large language models (Llama3-70b-versatile, Deepseek-V3, GPT-4.1-nano, and Llama4-maverick-instruct-basic) are compared with respect to their Python code generation capabilities. Using four benchmarks (HumanEval, MBPP, APPS, and Mercury Beyond). Functional correctness (pass@k), algorithmic runtime efficiency (Mercury- Beyond-Score) and maintainability (Cyclomatic Complexity & Maintainability Index) are evaluated. The results are then compared to findings of previous studies that focus on coding LLMs, revealing the fact that the difference in the performance of models dedicated to coding strongly depends on the chosen benchmark. The new Mercury benchmark was identified as particularly helpful for working out differences in functional correctness and runtime efficiency. Our findings also underline the necessity to broaden the scope to include maintainability. For example, the code generated by GPT-4.1-nano has good runtime efficiency compared to the others, yet lacks maintainability.

Autoren

Institutionen

Reutlingen University(DE)

Themen

Artificial Intelligence in Healthcare and EducationSoftware Engineering ResearchNatural Language Processing Techniques

Volltext beim Verlag öffnen

Holistic evaluation of LLM-Based Code Generation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen