Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of the bias of artificial intelligence generated images and large language models on their depiction of a surgeon

2023·33 Zitationen·ANZ Journal of SurgeryOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2023

Jahr

Abstract

The increasing integration of artificial intelligence (AI) into domains like medicine, surgery, and research, has brought unparalleled advancements, and changed how doctors, particularly surgeons, are perceived.1-9 The advent of AI-generated images using generative adversarial networks (GANs) and increased reliance on large language models (LLMs), have raised concerns regarding biases. Bias in AI models can be gender, racial, or cultural, which pertain to the systematic and unfair preferences or discrimination of certain demographics in the behaviour and outcomes of AI systems. For instance, AI-generated images may disproportionately represent surgeons of a specific gender or race.10 These biases often originate from their training data.10 The consequences of such biases include skewed public perception, discrimination, and potential loss of trust in healthcare professionals and AI systems. Certain demographics may also find greater difficulty in entering certain specialities of medicine or attaining promotions due to unconscious bias against them. The investigation of biases in AI-generated images and descriptions, of surgeons, provides insights into potential inaccuracies and emphasizes the ethical implications of technology's role in shaping public opinion. Understanding these biases is critical for developing more equitable and accurate AI models, for educational, clinical, and public usage. This paper systematically analyses AI-generated images and descriptions from various GANs and LLMs to identify and characterize biases in the representation of surgeons. We discuss the underlying factors contributing to these biases, examine their impact on public perception, and propose mitigation strategies. Four generative AI tools, comprising two LLMs (ChatGPT-3.5 and BARD) and two GANs (Dall-E2 and Midjourney), were prompted to describe and illustrate characteristics of eight types of surgeons. Twenty-four descriptions and 64 images were extracted from them, which were then independently analysed by three reviewers (J.C., I.S., and B.L.) for presumed skin tone (Massey Martin NIS Skin Scale Score), age, gender, and Body Silhouette Scale Score.11, 12 Light and dark-skin-toned surgeons were classified with Massey scores of 1-2 and 3-10, respectively. Any discrepancies were discussed by all authors until consensus was achieved. No ethics had to be acquired as all data was generated by the AI tools. DALL-E2 generated 71.9% male and 28.1% female representations. They showed a balanced age representation: 43.8% depicted surgeons under 50 years, while 56.2% showed over 50. Furthermore, DALL-E2 has a balanced skin tone distribution with 50% being light-skinned, and 50% dark-skinned. Regarding Body Silhouette Scale Scores, 73.1% scored between 1 and 5, whereas 26.9% scored 6 or above (Table 1; Figs. 1-8). Midjourney displayed more biased results, producing 87.5% male and 12.5% female images. Based on the Massey Martin NIS Skin Scale Score, 100% of the surgeons were of skin scale scores of 1 or 2 which were categorized as light skin colours. Most surgeons appeared to be above 50 years of age (71.9%), more than DALL-E2's representation. Moreover, 96.4% had smaller body silhouettes between 1 and 5 on the Body Silhouette Scale (Table 1; Figs. 1-8). BARD stresses that no single trait defines a surgeon despite acknowledging a trend of white male surgeons in their 50s, often possessing sufficient strength to endure the physically demanding work. It then discusses exceptions, especially women of colour across different ages with varying personalities. BARD also underscores qualities like intellect, intrinsic motivation, and compassion as typically desirable attributes. While it occasionally provides detailed surgeon exemplars, it concludes its responses by reiterating the diversity of surgeons beyond stereotypes (Figs. 1-8). ChatGPT-3.5 recurrently highlights the inconsequentiality of physical attributes in determining a surgeon's proficiency, instead stressing their competence and compassion. Three of its replies allude to the typically upper age range of surgeons due to the long medical training they undertake. ChatGPT-3.5 occasionally describes the characteristics of a surgeon, narrating their behaviours and attitudes within and outside the professional milieu (Figs. 1-8). The LLMs demonstrated a nuanced understanding of surgeons' diverse backgrounds without significant bias, indicating their quality design and equitable training. In contrast, the AI-generated images of surgeons from the GANs demonstrated notable gender and skin-tone biases. While the current data indicates a male-dominated surgical population in Australia, it raises the question of whether AI should reflect these disparities or present a more equal representation.13 Failure to address the underlying inequalities could perpetuate bias, and conflating the current proportions with aspirational goals of diversity and inclusivity can complicate the issue. It is crucial to recognize this difference, assess these proportions and work towards a fairer and more equitable representation of surgeons, representing both current and desired realities. Ultimately, we argue that AI models should depict medical personnel such as surgeons in a more equitable manner. Dall-E2 evenly represented light and dark skin tones, whereas Midjourney exclusively depicted lighter skin tones. This indicates Dall-E2's training data or data processing maintains diversity, while Midjourney's outputs reveal a clear bias. Such a pattern in Midjourney raises questions about its training data and possible post-training adjustments. The marked underrepresentation of surgeons presumed to be female, in the results from both GANs, stands out as a significant issue. This disparity may not be a random occurrence but perhaps has roots due to historical bias. As a result, the GANs' outputs, presenting more male surgeons, may reflect past imbalances in gender demographics within medical education and practice. The reason GANs might produce more male surgeons when prompted to create a ‘surgeon’ image ties back to their foundational reliance on training data. These systems learn from vast datasets of images, and if the majority of ‘surgeon’ images in their training data are of men, the GANs will learn the bias that a ‘typical’ surgeon appears as a male. This learning process lacks an innate moral compass, so it mirrors and perpetuates existing biases in data. For instance, if in the past few decades, 80% of surgeons were male, the dataset will likely contain more images of male surgeons, teaching the GANs that ‘surgeon’ equates more often to a male figure. This problem underscores the importance of curating diverse and balanced training datasets and continuously updating them to reflect current realities and aspirations for equality. It also highlights the need for interventions in the training of these AI models, such as introducing algorithms to detect and mitigate bias or employing fairness criteria, ensuring that the outputs do not continue historical biases but instead represent a more equitable vision of society. The consequences of such biases are multiple.14 Misrepresentation in AI outputs can reinforce stereotypes, skew perceptions, delay promotions, lead to poorer evaluations, and even influence decision-making processes in real-world clinical settings.15, 16 For instance, if an AI system associates surgical expertise with a specific gender or ethnicity, it may inadvertently influence hiring decisions or patient trust. These biases can erode trust in healthcare systems, as patients may develop preferences for certain surgeons based on AI-generated information, which can affect a surgeon's reputation and career development. For decades, certain professions, including surgery, have been associated predominantly with specific genders or backgrounds.17, 18 Age bias can skew representation of certain age groups, whilst body silhouette bias can perpetuate detrimental beauty standards. In this study, Midjourney mainly demonstrated surgeons as older individuals with narrower body silhouettes. DALL-E2 demonstrated less bias in its depictions than Midjourney but such bias was also present. Bias can manifest at various stages of AI model development, often stemming from their training data which might reflect real-world prejudices and inadvertently perpetuate these notions in its outputs.19 For example, an AI image generator's dataset predominantly comprising lighter-skinned individuals might underrepresent darker skin tones. Understanding biases in AI is essential. AI GANs and LLMs are trained on pre-existing online data. If this data underrepresents a certain demographic, the output will likely be skewed.20-24 Despite using unbiased algorithms, research shows gender biases persist.25, 26 Another concern is ‘Programmer Bias’, where non-representative developer demographics might introduce biases into software.27 The ‘Black Box’ issue further complicates matters, as AI algorithms' inner workings are often hard to interpret.20, 28 Some AI tools lack real-time internet connectivity, risking outdated references and potential reporting bias, especially in fields like healthcare. Machine biases often echoes their human developers'. However, without clear datasets information, current AI findings are mostly speculative based on observed outputs. Several strategies can help address these biases. Firstly, using diverse and representative sample data, incorporating images and descriptions of surgeons from various genders, ethnicities, and backgrounds ensures a holistic view of the profession.29, 30 Another strategy is adopting a continuous feedback loop, allowing users and experts to flag potential biases and identify nuances that might have been overlooked during the model's development phase. Moreover, incorporating expert reviews can provide a depth of analysis that general feedback might not capture.29 Iterative model improvements are crucial. In the fast-evolving world of AI, a model that remains static is one that will inevitably become obsolete or problematic. By incorporating feedback and continuously refining the model, developers ensure that the AI system remains relevant, accurate, and free from perpetuating harmful stereotypes. In addition to these strategies, transparency in model development, methodologies, and data sources can also foster trust.29 When users understand the mechanisms behind the AI outputs, they can engage more critically and constructively, further enhancing the model's credibility and performance. The primary constraint of this study stems from its dependence on a limited group of plastic surgery residents and plastic surgeons to assess the biases inherent in the GANs. This narrow scope may impede the broader applicability of the findings, potentially infusing the results with subjectivity and individual biases. However, this research represents, to the authors' understanding, an initial endeavour in exploring the biased representations of different surgical specialties within GAN outputs. Subsequent studies would benefit from extending this scrutiny to biases present in other AI systems, thereby offering a more holistic understanding of these pervasive limitations. This study sheds light on the biases present in some of the latest popular AI models. As AI models continue to permeate the medical field, it becomes imperative to assess these biases rigorously. Only through collective and informed action can we ensure that AI serves as an equitable, effective, and reliable resource in advancing global healthcare. The broader scientific community should engage in ongoing discourse on defining acceptable bias thresholds and establishing standardized bias evaluation metrics. Open access publishing facilitated by Monash University, as part of the Wiley - Monash University agreement via the Council of Australian University Librarians. Jevan Cevik: Conceptualization; writing – original draft; writing – review and editing. Bryan Lim: Conceptualization; writing – original draft; writing – review and editing. Ishith Seth: Conceptualization; writing – original draft; writing – review and editing. Foti Sofiadellis: Conceptualization; supervision. Richard J. Ross: Conceptualization; supervision. Roberto Cuomo: Supervision. Warren M. Rozen: Conceptualization; supervision.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationEmpathy and Medical EducationDiversity and Career in Medicine

Volltext beim Verlag öffnen

Assessment of the bias of artificial intelligence generated images and large language models on their depiction of a surgeon

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen