Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity

2025·1 Zitationen·Acta Orthopaedica et Traumatologica TurcicaOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

OBJECTIVE: This study aimed to comparatively evaluate the clinical knowledge generation performance of 3 widely used large language model (LLM)-based chatbots (ChatGPT, Claude, and Perplexity) in the context of septic arthritis. METHODS: This cross-sectional comparative study was based on 24 scenario-based clinical questions developed in accordance with the SANJO guideline (Management of Septic Arthritis in Native Joints) of the European Bone and Joint Infection Society. Responses generated by ChatGPT (OpenAI GPT-4), Claude 2 (Anthropic), and Perplexity AI were independently assessed by 2 senior experts: 1 in orthopedic surgery and the other in infectious diseases. Each response was evaluated across 6 domains: scientific accuracy, content depth, termino logical consistency, clinical applicability, brevity, and reference support, using a 5-point Likert scale. RESULTS: All 3 LLM-based chatbots achieved perfect scores in accuracy and terminological consistency (P = 1.000), and no significant dif ference was observed in clinical applicability (P = .912). Perplexity scored significantly lower in content depth compared to both ChatGPT (P = .001) and Claude (P = .041), whereas ChatGPT and Claude did not differ significantly (P = .807). ChatGPT produced significantly more unnecessary elaboration than Claude (P = .009) and Perplexity (P < .001), while Claude and Perplexity were comparable (P = .115). For reference support, Perplexity scored significantly higher than both ChatGPT (P < .001) and Claude (P < .001), with no difference between the latter 2 (P = 1.000). Overall, Perplexity achieved the highest total score (P < .001), followed by ChatGPT and Claude. Interrater agree ment was substantial (κ = 0.72). CONCLUSION: The LLM-based chat platforms demonstrated overall high performance, but their strengths differed across evaluation domains. While ChatGPT and Claude provided more comprehensive and detailed responses, Perplexity offered stronger reference sup port. These findings suggest that context-specific selection of LLMs is essential, as the optimal choice may vary depending on whether detailed explanation or robust referencing is prioritized. Cite this article as: Bayrak HC, Karagöz B, Bayrak Ö. Comparative evaluation of large language model-based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop Traumatol Turc., 2025;59(6):415-420.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in Service InteractionsDigital Mental Health Interventions

Volltext beim Verlag öffnen

Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen