Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of artificial intelligence on a reference standard based on subjective interpretation
49
Zitationen
3
Autoren
2021
Jahr
Abstract
Rapid progress has been made in artificial intelligence (AI) models for medical applications, especially over the past 5 years, with substantial efforts focusing on diagnosis from medical images. An essential aspect of evaluating the performance of AI models and their potential clinical utility is the rigor of the reference standard.1Liu Y Chen P-HC Krause J Peng L How to read articles that use machine learning: users' guides to the medical literature.JAMA. 2019; 322: 1806-1816Crossref PubMed Scopus (186) Google Scholar A reference standard is “the best available method for establishing the presence or absence of the target condition”,2Cohen JF Korevaar DA Altman DG et al.STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration.BMJ Open. 2016; 6e012799Crossref PubMed Scopus (834) Google Scholar and is thus equivalent to what is commonly referred to as the ground truth in AI literature. Determination of what constitutes a reference standard is established by “opinion and practice within the medical, laboratory, and regulatory community”.3Center for Devices and Radiological HealthStatistical guidance on reporting results from studies evaluating diagnostic tests—guidance for industry and FDA staff.https://www.fda.gov/regulatory-information/search-fda-guidance-documents/statistical-guidance-reporting-results-studies-evaluating-diagnostic-tests-guidance-industry-and-fdaDate: March 13, 2017Date accessed: May 17, 2020Google Scholar The reference standard can either be a widely agreed-upon gold standard2Cohen JF Korevaar DA Altman DG et al.STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration.BMJ Open. 2016; 6e012799Crossref PubMed Scopus (834) Google Scholar or, in its absence, a proxy that is highly correlated with the clinical outcome. Although a non-reference standard can also be used, correctness claims such as accuracy, sensitivity, and specificity should be dropped in favour of agreement with a comparative method.3Center for Devices and Radiological HealthStatistical guidance on reporting results from studies evaluating diagnostic tests—guidance for industry and FDA staff.https://www.fda.gov/regulatory-information/search-fda-guidance-documents/statistical-guidance-reporting-results-studies-evaluating-diagnostic-tests-guidance-industry-and-fdaDate: March 13, 2017Date accessed: May 17, 2020Google Scholar In this Comment, we focus on tasks in which the so-called ground truth is subjective (ie, involving human interpretation). When the quality of this ground truth exceeds a task-specific quality requirement, it might be considered a reference standard. For instance, an outcome such as mortality involves interpretation, but is rarely ambiguous. However, grading a retinal photograph for diabetic retinopathy can be subject to greater variability. As an example, we will consider the Gleason grading of prostate biopsy samples, a subjective task that involves substantial intra-rater and inter-rater variability and that has been the focus of multiple AI development efforts.4Bulten W Pinckaers H van Boven H et al.Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study.Lancet Oncol. 2020; 21: 233-241Summary Full Text Full Text PDF PubMed Scopus (179) Google Scholar, 5Ström P Kartasalo K Olsson H et al.Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study.Lancet Oncol. 2020; 21: 222-232Summary Full Text Full Text PDF PubMed Scopus (168) Google Scholar, 6Nagpal K Foote D Tan F et al.Development and validation of a deep learning algorithm for Gleason grading of prostate cancer from biopsy specimens.JAMA Oncology. 2020; 61372Crossref PubMed Scopus (44) Google Scholar, 7Pantanowitz L Quiroga-Garza GM Bien L et al.An artificial intelligence algorithm for prostate cancer diagnosis in whole slide images of core needle biopsies: a blinded clinical validation and deployment study.Lancet Digit Health. 2020; 2: e407-e416Summary Full Text Full Text PDF PubMed Scopus (65) Google Scholar Although an ideal reference standard might be based on an outcome such as cancer-specific mortality8Wulczyn E Nagpal K Symonds M et al.Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading.Commun Med. 2021; (published online June 30.)https://doi.org/10.1038/s43856-021-00005-3Crossref Google Scholar or PSA resurgence, these outcomes not only require years to develop but are also confounded by the original diagnosis and treatment in retrospective studies. As such, arguably the best available ground truth to validate an AI Gleason grading model is Gleason grading by expert urological pathologists, which also reflects the best available standard of care. The variability in ground truth can substantially affect the measured performance of the AI model. As an illustrative example, we present a simplified binary classification problem (eg, is a tumour present; or is the tumour's Gleason score at least 4 + 3=7) in the figure. If 30% of the ground truth grades change on the basis of which grader reviewed each case, then the measured concordance between AI and grades from different graders can also potentially change by as much as 30%; a reported 70% concordance result might become 40% if compared with the grades from another grader. Such a caveat is clearly unacceptable, although such an occurrence should be fortunately rare (because it assumes the unlikely scenario that the 30% of cases with changed grades are exactly those for which the AI agreed with the first grader). The change can also be positive; the 70% result can range from 40% to 100% (figure A). How might the rigor of the ground truth for a subjective task such as Gleason grading be improved? On the basis of the example in the figure, the variability in the ground truth needs to be reduced. Three potential solutions are recruiting experienced or specialised graders; increasing the number of graders; and applying an unbiased resolution process when graders disagree (table). The first method is straightforward: graders should have relevant specialised expertise. For Gleason grading, urological subspecialists' grades show lower inter-rater variability and have a higher prognostic value than that of generalists. The second aspect—increasing the number of graders—can help to spot accidental errors (eg, from fatigue) or subtle aspects missed by chance, and improve the representativeness of graders reviewing each case. The last aspect—resolving disagreements—is more nuanced. Graders should review cases independently to reduce the chances of groupthink or dominant voices biasing the grades. Although discussion between graders can help to resolve disagreements, some disagreements happen because of inherent ambiguity in the clinical guidelines. As such, the resolution process needs to be unbiased by either having separate arbiters review the dissenting opinions and determine the final grade, or through a systematic voting process.9Bertens LCM Broekhuizen BDL Naaktgeboren CA et al.Use of expert panels to define the reference standard in diagnostic research: a systematic review of published methods and reporting.PLoS Med. 2013; 10e1001531Crossref PubMed Scopus (71) Google ScholarTableThree synergistic approaches to improve the reproducibility of the ground-truthing process and thus increase the rigor of the ground truthBenefitsChallengesRecommendationIncrease the expertise of gradersExperienced experts are more likely to confidently and correctly grade difficult cases; including experts from multiple institutions can provide diversity in opinions that draw from a wider breadth of clinical experience and patient populationsLimited numbers and availability of expertsInvolve experts with sufficient experience to review the cases; the experts can be experienced subspecialists in the same institutions or world-class expertsIncrease the number of graders for each caseEasiest way to obtain additional opinions; can reduce accidental errors (eg, from fatigue or missed subtle findings)Time costs (ie, limited by the slowest grader); difficulty in managing schedules; limited number of graders available; the complexity of communication increases with more graders; potential difficulty reconciling differences in opinions between gradersEnsure that more than one grader reviews every case to reduce the chances of grading errors; if a large pool of graders is available, parallelise reviews to speed up the processEnsure disagreement resolution process is unbiasedReduces bias by reducing the effect of a dominating opinion (whereby some graders' opinions affect the resolution process)If the resolution process involves discussion, scheduling can be difficultEnsure initial reviews are performed independently to avoid groupthink; anonymise opinions during discussion if possible; if agreement cannot be reached via discussion alone, consider a systematic voting process or involving a separate arbiter with access to previous anonymised reviews Open table in a new tab To evaluate the rigor of a so-called ground-truthing process, we propose measuring the reproducibility of the process itself. For example, if the majority vote of a panel is used, how often does the same panel converge on the same grade after a washout period (intra-panel concordance), and how often do two panels agree (inter-panel concordance)? A highly reproducible process will decrease the uncertainty in the measured performance. For example, if a process results in a 90% reproducibility, then the uncertainty around the same 70% result as in the previous example can vary only from 60% to 80% (figure B, C). Presumably, reviews by large panels of experienced experts will result in the highest reproducibility. However, the number and experience of experts that can be involved in a study might be reduced by practical considerations; the consensus of the world's ten foremost experts might be desirable, but infeasible. This trade-off between the quality of the ground-truthing process and the resources needed could be calibrated on the basis of the stage of the work: perhaps fewer readers for proof-of-concept studies, and a larger panel of experienced readers as the algorithms mature towards real-use cases. Perhaps in recognition of the trade-off, a US Government guidance currently reads: “agencies must avoid a precautionary approach that holds AI systems to such an impossibly high standard that society cannot enjoy their benefits”.10Vought RT Guidance for regulation of artificial intelligence applications.https://www.whitehouse.gov/wp-content/uploads/2020/01/Draft-OMB-Memo-on-Regulation-of-AI-1-7-19.pdfDate accessed: September 14, 2021Google Scholar If a panel of experts has a non-ideal reproducibility, this issue might indicate a fundamental ambiguity in the grading system or difficulty in disease characterisation. More importantly, the cases in which experts disagree might be precisely the difficult cases forich AI could provide consistent opinions with the greatest value, and must not be reflexively excluded from analysis. As such, practical considerations and inherent problem complexity should be balanced with the need for a rigorous ground-truthing process for the purposes of accurately estimating AI performance. We have discussed several factors involved in establishing the ground truth for subjective diagnostic tasks, and how to reduce the uncertainty in the study conclusions: involving experienced or specialised experts, leveraging multiple graders, and an unbiased disagreement resolution process. Measuring the reproducibility of the process can help to indicate the precision of the study conclusions. Whether the process then enables the final ground truth to be considered a reference standard will require further debate, and will probably depend on the specific clinical scenario. However, we anticipate that measuring reproducibility can help with that determination. P-HCC, CHM, and YL are employees of Google LLC, own Alphabet stock, and report several patents granted or pending on machine learning models for medical images related to prostate cancer but outside the content of this Comment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.