Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course

2024·31 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>n</mml:mi> <mml:mo>=</mml:mo> <mml:mn>300</mml:mn></mml:mrow> </mml:math> data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8)-a statistically significant difference (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>2.482</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>10</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ). Prompt engineering significantly improved scores for both GPT-4 (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>1.661</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>4</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ) and GPT-3.5 (p = <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mn>4.967</mml:mn> <mml:mo>×</mml:mo> <mml:msup><mml:mn>10</mml:mn> <mml:mrow><mml:mo>-</mml:mo> <mml:mn>9</mml:mn></mml:mrow> </mml:msup> </mml:mrow> </mml:math> ). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from 'Definitely AI' to 'Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary 'AI' or 'Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

Autoren

Institutionen

Durham University(GB)

Themen

Artificial Intelligence in Healthcare and EducationCOVID-19 diagnosis using AIAdvancements in Semiconductor Devices and Circuit Design

Volltext beim Verlag öffnen

A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen