Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Fairness metrics for health AI: we have a long way to go
15
Zitationen
4
Autoren
2023
Jahr
Abstract
The use of Artificial Intelligence (AI) is on track to revolutionize healthcare, with performance in medical tasks such as clinical diagnosis often being comparable to expert-level accuracy, at least in the laboratory. AI can play a significant role in healthcare, enabling clinicians to make more accurate and timely diagnoses and devise effective treatment plans. However, the amplification of pre-existing healthcare inequity with the use of AI models is a legitimate concern. Recent works have shown that medical imaging AI models can easily encode and learn patient-sensitive characteristics1Gichoya J.W. Banerjee I. Bhimireddy A.R. et al.AI recognition of patient race in medical imaging: a modeling study.Lancet Digit Health. 2022; 4: e406-e414Summary Full Text Full Text PDF PubMed Scopus (72) Google Scholar and cause significant performance disparities between patient subgroups.2Seyyed-Kalantari L. Zhang H. McDermott M.B. Chen I.Y. Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.Nat Med. 2021; 27: 2176-2182Crossref PubMed Scopus (82) Google Scholar Therefore, it is encouraging to see more attempts, such as from Glocker et al.,3Glocker B. Jones C. Bernhardt M. Winzeck S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models.eBioMedicine. 2023; 89: 104467https://doi.org/10.1016/j.ebiom.2023.104467Summary Full Text Full Text PDF PubMed Scopus (3) Google Scholar that evaluate methods for assessing how sensitive patient information, such as ethnicity and sex, are encoded and possibly used in model predictions. Unfortunately, for many diagnostic and prognostic clinical applications, the “ground truth” used for fairness assessment metrics may already be embedded with biases and laced with suboptimal outcomes that are not explained by clinical features. As such, the medical AI community needs to go beyond solely evaluating the clinical readiness of AI models with metrics that are predicated on potentially biased and constantly shifting clinical ground truth. A critical step towards addressing AI model bias and subgroup disparities is the establishment of common principles, guidelines, and standards that model developers adhere to. These standards would need to emphasize the importance of fairness and transparency in AI systems' design and deployment. Proper documentation of model performance across patient subgroups is a minimum requirement. Depending on the clinical use case, models should be designed and evaluated with additional impact metrics that consider existing health inequities and possible harm for disadvantaged subgroups. Recent work by MEDFAIR,4Zong Y. Yang Y. Hospedales T. MEDFAIR: benchmarking fairness for medical imaging.in: Proceedings of the Eleventh International Conference on Learning Representations (ICLR). 2023Google Scholar a benchmark for building and evaluating fair medical imaging models, is a contribution towards this. An ideal guideline would need to cover requirements for appropriate debiasing techniques and evaluation metrics for different sources of bias. These include, but are not limited to, bias arising from dataset composition, model feature encoding, the use of learned demographic features (also known as shortcut features), and bias in ground truth labels. Datasets can encode bias, such as from underrepresentation of already disadvantaged subgroups. Clinician bias can also be reflected in data and learned by AI. In medical images, bias may even be introduced from access to different quality scanners. These biases in the data should be documented, e.g., by using “datasheets for datasets”.5Gebru T. Morgenstern J. Vecchione B. et al.Datasheets for datasets.Commun ACM. 2021; 64: 86-92Crossref Scopus (299) Google Scholar Federated learning methods can also aid in training/tuning model(s) on more varied databases from different parts of the world and/or from underrepresented subgroups. Moreover, dataset bias mitigating strategies may be helpful, including dataset preprocessing, e.g., reweighing unintended features so that they are statistically independent of the target/outcome label.6Kamiran F. Calders T. Data preprocessing techniques for classification without discrimination.Knowl Inf Syst. 2012; 33: 1-33Crossref Scopus (525) Google Scholar However, it is unclear how well these methods work for medical images. Model feature encoding is another source of bias. AI models can identify race and sex from medical images across modalities and use these characteristics to detect diseases, even when such characteristics are not associated with the diagnosis.1Gichoya J.W. Banerjee I. Bhimireddy A.R. et al.AI recognition of patient race in medical imaging: a modeling study.Lancet Digit Health. 2022; 4: e406-e414Summary Full Text Full Text PDF PubMed Scopus (72) Google Scholar Even after removing sensitive information from datasets, which may not even be possible for medical images, models can still encode and use other correlated features for prediction. The “fairness through awareness” framework7Dwork C. Hardt M. Pitassi T. Reingold O. Zemel R. Fairness through awareness.in: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 2012: 214-226Crossref Scopus (1558) Google Scholar shows why we cannot assume sensitive information has been expunged from a dataset. The framework also offers a metric-based approach for ensuring that a model's labeling of similar individuals is indeed similar. Furthermore, models can inherit disparities from medical data through learning to depend on correlations between unrelated input features (e.g., nonbinary gender, immigration status), and the predicted outcomes. Glocker et al. highlighted difficulties in detecting what information is used in model predictions,3Glocker B. Jones C. Bernhardt M. Winzeck S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models.eBioMedicine. 2023; 89: 104467https://doi.org/10.1016/j.ebiom.2023.104467Summary Full Text Full Text PDF PubMed Scopus (3) Google Scholar despite trying a range of methods from transfer learning, multitask learning, and unsupervised exploration of feature representations. Besides these methods, algorithmic transparency, explainability and interpretability,8Salahuddin Z. Woodruff H.C. Chatterjee A. Lambin P. Transparency of deep neural networks for medical image analysis: a review of interpretability methods.Comput Biol Med. 2022; 140: 105111Crossref Scopus (60) Google Scholar,9Jungmann F. Ziegelmayer S. Lohoefer F.K. et al.Algorithmic transparency and interpretability measures improve radiologists' performance in BI-RADS 4 classification.Eur Radiol. 2023; 33: 1844-1851Crossref PubMed Scopus (1) Google Scholar focus instead on understanding how encoded input features are used for model prediction. Without an in-depth understanding of what features AI models use in making predictions, the promise of AI may not be realized. Few have explored metrics that quantify the effect of training on potentially biased ground truth labels. The closest in the fairness literature involves social welfare functions6Kamiran F. Calders T. Data preprocessing techniques for classification without discrimination.Knowl Inf Syst. 2012; 33: 1-33Crossref Scopus (525) Google Scholar that aim to capture the underlying social phenomena and inequities when the model learns from data. More work is needed to develop metrics that are not completely reliant on ground truth labels for assessing readiness of medical imaging AI tools. Short of such metrics, intra- and post-processing de-biasing techniques may help reduce subgroup performance disparity. An example was employed in recent work on neural network pruning and fine-tuning for chest X-ray classifiers.10Marcinkevics R. Ozkan E. Vogt J.E. Debiasing deep chest x-ray classifiers using intra-and post-processing methods.in: Machine Learning for Healthcare Conference. PMLR, 2022: 504-536Google Scholar AI in healthcare is intended to improve access to quality healthcare, especially for those who are marginalized. It is worrisome to find evidence across many works that these models utilize non-clinical demographic attributes and are likely to propagate existing disparities. Current attempts to understand how imaging models encode and use non-clinical demographic information for prediction are encouraging, but are still limited. More interdisciplinary communication and collaboration between AI researchers, healthcare providers, social scientists, and the public would be needed to advance fairness, transparency, and accountability of medical imaging models. All the authors participated in the outline development, writing and editing of the manuscript. L.A.C. received support for attending meetings and/or for travel by Massachusetts Institute of Technology, and cloud credits from Amazon, Google, and Oracle. The other authors have no conflicts of interest to declare. Algorithmic encoding of protected characteristics in chest X-ray disease detection modelsSubgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlying causes of disparities. Full-Text PDF Open Access
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.