OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 15:55

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Response to the letter to the editor: “Beyond readability scores: Reconsidering the evaluation of large language models in prosthodontic patient education”

2025·0 Zitationen·Journal of Prosthodontics
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

We appreciate Dr. Ardila's thoughtful and constructive reflections on our recent publication, “Assessing readability and accuracy of content produced by the American College of Prosthodontists and large language models for patient education in prosthodontics.”1 We are encouraged that our work stimulated discourse on how best to evaluate the educational and clinical potential of large language models (LLMs) in prosthodontics. His comments offer valuable perspectives that complement our original intent—to initiate empirical inquiry into the clarity, consistency, and reliability of artificial intelligence (AI)-generated patient information. Dr. Ardila rightly highlights that readability indices such as the Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL) quantify linguistic complexity rather than actual comprehension. Indeed, readability is only one dimension of patient understanding. Our study deliberately focused on readability and factual accuracy as measurable entry points to characterize LLM performance using standardized, reproducible metrics. As Ardila notes, complementary tools like the patient education materials assessment tool (PEMAT) and suitability assessment of materials (SAM),2, 3 could deepen future analyses by incorporating understandability and actionability constructs. We fully agree that subsequent studies should integrate these multidimensional assessments to bridge syntactic simplicity and real-world interpretability. We also concur with the suggestion that patient interaction with LLMs is iterative and dialogic rather than static. Our approach evaluated single-response outputs under controlled conditions to ensure comparability across models. However, simulation-based designs replicating conversational dynamics—such as patient–LLM exchanges—represent an important next step. Similar frameworks have been successfully tested in medical education to assess LLM-facilitated history-taking and feedback loops.4 These approaches could reveal not only linguistic but also behavioral aspects of AI-mediated communication in dental contexts. We acknowledge the inherent asymmetry between institutional resources (e.g., ACP FAQs) and generative model outputs. Our rationale for using the ACP website was its accessibility, credibility, and relevance to the intended patient audience. As Ardila notes, future studies might incorporate additional curated repositories (e.g., ADA or NIH resources) to diversify reference materials and mitigate potential content imbalance. The comparison, however, provided a necessary benchmark for initial calibration of model performance against a trusted standard within organized prosthodontics. The comment regarding rubric transparency is appreciated. We used consensus-based criteria to define “major” versus “minor” inaccuracies and reported interrater reliability to ensure internal consistency. While our manuscript summarized the process, future reports will include explicit operational examples and error taxonomies. The emerging Minimum Reporting Items for LLM Evaluation in Healthcare (MI-CLEAR-LLM) guidelines emphasize version control, prompt documentation, and rubric transparency,5 which we fully support and plan to adopt in ongoing work. Ardila's observation that greater word count does not necessarily reflect better educational quality is well taken. Our correlation analysis identified associations but did not infer causation. Evaluating “relevance density,” as suggested, could refine assessment frameworks to focus on informational efficiency rather than verbosity.6 Excessive detail may overwhelm readers, underscoring the importance of balancing depth with cognitive load principles in patient education.7 We agree that version tracking and prompt archival are critical for reproducibility given the rapid evolution of LLMs. Our data were generated in October 2024 using then-current model versions (ChatGPT, Copilot, and Gemini). Future research should indeed incorporate time-stamped logs, parameter metadata, and open repositories, aligning with reproducibility practices outlined by Park et al.5 and Lekadir et al.8 We appreciate the emphasis on factual provenance and transparency. Our study assessed content accuracy but did not evaluate citation traceability because current LLMs lack consistent source attribution mechanisms. However, integrating “source auditability” and explainability metrics would strengthen validity and ethical oversight in AI-driven patient communication.9 In addition, safety parameters—such as models’ ability to flag uncertainty or refer users to professional consultation—merit systematic evaluation to ensure patient protection and compliance with medico-legal standards.10 We thank Dr. Ardila for advancing this scholarly dialogue. His letter underscores the need for broader, multidimensional frameworks that integrate linguistic, behavioral, ethical, and technical analyses of AI in patient education. We share his vision that prosthodontic researchers must move beyond readability scores to understand how LLMs can safely augment patient learning and engagement. Ongoing interdisciplinary collaborations between clinicians, informaticians, and behavioral scientists will be essential to transform these technologies into trustworthy, equitable tools for prosthodontic care.

Ähnliche Arbeiten