Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course
31
Zitationen
3
Autoren
2024
Jahr
Abstract
This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>n</mml:mi> <mml:mo>=</mml:mo> <mml:mn>300</mml:mn></mml:mrow> </mml:math> data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8)-a statistically significant difference (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>2.482</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>10</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ). Prompt engineering significantly improved scores for both GPT-4 (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>1.661</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>4</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ) and GPT-3.5 (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>4.967</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>9</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from 'Definitely AI' to 'Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary 'AI' or 'Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.