OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 14.03.2026, 22:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing and Mitigating Multi-Turn Jailbreak Vulnerabilities in Large Language Models: A Crescendo Attack Study

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) are vulnerable to multi-turn jailbreak attacks like Crescendo, which exploit context retention to bypass safety mechanisms and generate harmful outputs. This study is among the first to address the gap in understanding vulnerabilities of LLMs released after 2024 by reproducing the Crescendo attack on six recent models. We introduce two novel mitigation strategies—prompt hardening and LLM-as-guardrail—evaluated via Attack Success Rates (ASRs) across 15 tasks. Our contributions include these mitigation strategies, a comprehensive vulnerability assessment, and an open-source framework shared under the MIT License to foster collaborative AI safety research. Findings reveal that Llama-3.1-8b-Instruct (Llama 3.1) was the most resistant, achieving the lowest ASRs in 9 of 15 categories due to robust safety alignment, while DeepSeek-R1-Distill-Qwen-32B (DeepSeek R1) was highly vulnerable with 1.0 ASRs in 13 categories. Prompt hardening significantly outperformed LLM-as-guardrail, reducing ASRs effectively across models, though DeepSeek R1 maintained 1.0 ASRs despite mitigations. These results underscore prompt hardening as a superior defense, enhancing LLM security for safer deployment in critical domains like healthcare, education, and public policy.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Adversarial Robustness in Machine LearningArtificial Intelligence in Healthcare and EducationAdvanced Malware Detection Techniques
Volltext beim Verlag öffnen