Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
Ähnliche Arbeiten
The global landscape of AI ethics guidelines
2019 · 4.582 Zit.
The Limitations of Deep Learning in Adversarial Settings
2016 · 3.868 Zit.
Trust in Automation: Designing for Appropriate Reliance
2004 · 3.417 Zit.
Fairness through awareness
2012 · 3.279 Zit.
Mind over Machine: The Power of Human Intuition and Expertise in the Era of the Computer
1987 · 3.183 Zit.