Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
In Reply to Spadafore and Monrad
0
Zitationen
3
Autoren
2019
Jahr
Abstract
We thank the authors for expressing their concerns regarding potential bias in the implementation of machine learning (ML) to scoring of the United States Medical Licensing Examination Step 2 Clinical Skills (CS) patient notes (PN), and appreciate this opportunity to address these ahead of an upcoming technical report submitted for publication. We have worked with experts in educational measurement and medical informatics to minimize threats to validity, including implicit bias. Rather than using a black-box ML approach, we incorporated transparency in the system design, eliminated the use of individual human ratings in ML training sets, and built in fail-safe steps to ensure accuracy of pass/fail classifications. In the cited example of bias in the article “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings,”1 word representations were learned from news articles in which, for centuries, women have been associated with household activities. Therefore, the language model trained from such data captured those associations. Our training sets are compiled of cross-validated judgments of multiple independent groups of experts (rather than individual human ratings), which allows us to monitor, detect, and minimize potential bias. The authors’ skepticism about an ML system that is “more objective than the humans it emulates” is valid. Typically, ML and natural language processing systems are evaluated relative to a standard set by human judgment and so, theoretically, cannot be superior to human ratings. However, in practice, human raters suffer from fatigue and other challenges associated with quality control. Therefore, while an automated scoring system cannot be superior to the ideal human rater, it can be more adherent to predefined scoring rubrics than the typical human rater in practice. We agree with the authors that producing reliable and valid scores for the Step 2 CS exam is critical. For this reason, we included a number of verification steps in our system design. Notes flagged by the system as containing unusual features, notes containing terminology that can’t be recognized by the system, and all notes for examinees whose performance falls near or below the level of minimum competence are reviewed by human raters. In our Invited Commentary, we proposed a hybrid scoring system whereby no examinee receives a failing score without an additional independent review by human raters. We are confident that this will allow PN scoring to benefit from the strengths of ML while maintaining the psychometric rigor of the Step 2 CS Exam. Jessica Salt, MD, MBEAssistant vice president of assessment, Clinical Skills Evaluation Collaboration, Philadelphia, Pennsylvania; [email protected] Polina Harik, PhDSenior measurement scientist, National Board of Medical Examiners, Philadelphia, Pennsylvania. Michael A. Barone, MD, MPHVice president of licensure programs, National Board of Medical Examiners, Philadelphia, Pennsylvania.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.100 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.466 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.