OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.03.2026, 16:12

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Algorithmic red teaming approaches to secure LLMs

2025·0 Zitationen·Machine Learning with ApplicationsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2025

Jahr

Abstract

Algorithmic red teaming for Large Language Models (LLMs) is a crucial practice for proactively ensuring their safety and robustness. This process involves using an LLM as an adversary to test the vulnerabilities of a target LLM, which is essential for identifying and mitigating potential security risks before the model is deployed. Automated methodologies, which surpass the constraints of human creativity, utilize a triad of models: an attacker, a target, and a judge. This primer provides a concise summary and comparison of several state-of-the-art algorithmic red-teaming approaches, including TAP, PAIR, Crescendo, and AutoDAN-Turbo. The goal of these techniques, such as prompt injection and jailbreaking, is to push LLMs beyond their intended safe behavior. Critically, the non-deterministic nature of LLMs presents a key challenge when they are utilized as judges, potentially rendering evaluations unreliable. The paper stresses that red teaming is not a one-time exercise and is particularly vital for AI agents that use LLMs as components, as a single failure can lead to significant public scrutiny. • The position paper establishes red teaming as a mandatory practice to proactively identify and mitigate vulnerabilities in Large Language Models (LLMs) before they fail in real-world applications. • It outlines the distinction between manual red teaming, which is constrained by human creativity, and automated red teaming, which leverages LLMs themselves to find weaknesses in a “ fire meets fire ” approach. • It explains the typical architecture for automated red teaming, which involves a triad of models: an attacker to generate adversarial prompts, a target model to be tested, and a judge model to evaluate the success of the attack. • This primer provides a concise summary and comparison of several state-of-the-art algorithmic red-teaming approaches, including TAP, PAIR, Crescendo, AutoDAN-Turbo. • The manuscript highlights a key challenge in the field: non-deterministic nature of the LLMs makes them potentially unreliable judges. It proposes solutions such as aggregating results from differently architectured models, even models with same architecture yet varying token-optimization and parameterization techniques, rule-alignment, and human feedback.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Adversarial Robustness in Machine LearningArtificial Intelligence in Healthcare and EducationEthics and Social Impacts of AI
Volltext beim Verlag öffnen