Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Algorithmic red teaming approaches to secure LLMs
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Algorithmic red teaming for Large Language Models (LLMs) is a crucial practice for proactively ensuring their safety and robustness. This process involves using an LLM as an adversary to test the vulnerabilities of a target LLM, which is essential for identifying and mitigating potential security risks before the model is deployed. Automated methodologies, which surpass the constraints of human creativity, utilize a triad of models: an attacker, a target, and a judge. This primer provides a concise summary and comparison of several state-of-the-art algorithmic red-teaming approaches, including TAP, PAIR, Crescendo, and AutoDAN-Turbo. The goal of these techniques, such as prompt injection and jailbreaking, is to push LLMs beyond their intended safe behavior. Critically, the non-deterministic nature of LLMs presents a key challenge when they are utilized as judges, potentially rendering evaluations unreliable. The paper stresses that red teaming is not a one-time exercise and is particularly vital for AI agents that use LLMs as components, as a single failure can lead to significant public scrutiny. • The position paper establishes red teaming as a mandatory practice to proactively identify and mitigate vulnerabilities in Large Language Models (LLMs) before they fail in real-world applications. • It outlines the distinction between manual red teaming, which is constrained by human creativity, and automated red teaming, which leverages LLMs themselves to find weaknesses in a “ fire meets fire ” approach. • It explains the typical architecture for automated red teaming, which involves a triad of models: an attacker to generate adversarial prompts, a target model to be tested, and a judge model to evaluate the success of the attack. • This primer provides a concise summary and comparison of several state-of-the-art algorithmic red-teaming approaches, including TAP, PAIR, Crescendo, AutoDAN-Turbo. • The manuscript highlights a key challenge in the field: non-deterministic nature of the LLMs makes them potentially unreliable judges. It proposes solutions such as aggregating results from differently architectured models, even models with same architecture yet varying token-optimization and parameterization techniques, rule-alignment, and human feedback.
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.338 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.418 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.303 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.301 Zit.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015 · 18.499 Zit.