Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity
1
Zitationen
3
Autoren
2025
Jahr
Abstract
OBJECTIVE: This study aimed to comparatively evaluate the clinical knowledge generation performance of 3 widely used large language model (LLM)-based chatbots (ChatGPT, Claude, and Perplexity) in the context of septic arthritis. METHODS: This cross-sectional comparative study was based on 24 scenario-based clinical questions developed in accordance with the SANJO guideline (Management of Septic Arthritis in Native Joints) of the European Bone and Joint Infection Society. Responses generated by ChatGPT (OpenAI GPT-4), Claude 2 (Anthropic), and Perplexity AI were independently assessed by 2 senior experts: 1 in orthopedic surgery and the other in infectious diseases. Each response was evaluated across 6 domains: scientific accuracy, content depth, termino logical consistency, clinical applicability, brevity, and reference support, using a 5-point Likert scale. RESULTS: All 3 LLM-based chatbots achieved perfect scores in accuracy and terminological consistency (P = 1.000), and no significant dif ference was observed in clinical applicability (P = .912). Perplexity scored significantly lower in content depth compared to both ChatGPT (P = .001) and Claude (P = .041), whereas ChatGPT and Claude did not differ significantly (P = .807). ChatGPT produced significantly more unnecessary elaboration than Claude (P = .009) and Perplexity (P < .001), while Claude and Perplexity were comparable (P = .115). For reference support, Perplexity scored significantly higher than both ChatGPT (P < .001) and Claude (P < .001), with no difference between the latter 2 (P = 1.000). Overall, Perplexity achieved the highest total score (P < .001), followed by ChatGPT and Claude. Interrater agree ment was substantial (κ = 0.72). CONCLUSION: The LLM-based chat platforms demonstrated overall high performance, but their strengths differed across evaluation domains. While ChatGPT and Claude provided more comprehensive and detailed responses, Perplexity offered stronger reference sup port. These findings suggest that context-specific selection of LLMs is essential, as the optimal choice may vary depending on whether detailed explanation or robust referencing is prioritized. Cite this article as: Bayrak HC, Karagöz B, Bayrak Ö. Comparative evaluation of large language model-based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop Traumatol Turc., 2025;59(6):415-420.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.578 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.470 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.984 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.814 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.