Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
0
Zitationen
1
Autoren
2024
Jahr
Abstract
Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak' challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts, addressing the significant complexity and computational costs associated with conventional methods. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved robust against model updates. The jailbreak prompts generated were not only naturally-worded and succinct but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.327 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.399 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.297 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.274 Zit.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015 · 18.492 Zit.