OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 15.03.2026, 09:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Using Large Language Models to Audit Model Healthcare Biases

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Large language models (LLMs) can potentially mitigate pain points in healthcare tasks such as decision support, text summarization, and question-answering. However, LLMs exhibit bias related to race, gender identity, sexual orientation, and other demographics, posing a major concern. Although human review helps reduce bias, the sheer data volume renders thorough evaluation impractical and onerous at scale. This motivates the use of LLMs in auditing models for bias. This study uses the Stanford Healthcare red-teaming dataset, which contains prompts, outputs, and expert-level bias labels, to examine how model size and prompting techniques affect bias detection with GPT-3.5-turbo, GPT-4o, llama3.3, and o1-mini. Our results show that the best model for bias detection depends on the chosen metric. Smaller, cost-effective models like o1-mini outperformed GPT-4o in precision and F1 scores, with up to 53.11% higher precision and 10.32% higher F1. This suggests that smaller models may be preferable when precision or F1 is a priority. Additionally, self-critiquing capabilities in larger models do not significantly improve bias detection over smaller models (χ2, p = 0.597). Moreover, the use of prompting techniques, particularly Thread of Thought, significantly enhanced bias detection across all models, (χ2, p < 0.001). Our findings suggest that depending on the metric of concern for the auditor, smaller models can offer a costeffective alternative to larger models.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareTopic Modeling
Volltext beim Verlag öffnen