Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
0
Zitationen
8
Autoren
2026
Jahr
Abstract
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks from code generation to program repair, producing a massive volume of software artifacts. This surge in automated creation has exposed a critical bottleneck: the lack of scalable and reliable methods to evaluate the quality of these outputs. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. This approach leverages the advanced reasoning and coding capabilities of LLMs themselves to perform automated evaluations, offering a compelling path toward achieving both the nuance of human insight and the scalability of automated systems. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.