A unified framework for generating, categorizing, and continuously evaluating adversarial jailbreak prompts — grounded in cybersecurity taxonomy and the Optimus two-dimensional metric.
REAL-TIME TOKEN GENERATION — strategy × seed → LLM → token stream
Strategies compose with harmful seeds → 6 LLMs categorize simultaneously while optimal strategies are ranked → Optimus scores each prompt → Data tiering → Instruction fine-tuning.
Jailbreak attacks — adversarial prompts that bypass LLM alignment through purely linguistic manipulation — pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically.
This paper addresses that gap with three tightly integrated contributions: a 114,000-prompt cybersecurity-grounded compositional dataset, automated jailbreak generators via instruction fine-tuning, and Optimus — a two-dimensional, training-free metric J(S, H) that jointly captures semantic similarity and harmfulness probability.
Our generators achieve perplexity 24–39 versus 40–140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29–0.51 (LlamaPromptGuard-2-86M). Optimus exposes a continuous "stealth-optimal" regime — (S* ≈ 0.57, H* ≈ 0.43) — that binary ASR entirely collapses.
An animated walkthrough of how the system transforms raw jailbreak strategies and harmful seeds into fine-tuned generator models.
Our compositional fine-tuned models outperform token-level baselines across fluency, safety-evasion, and adversarial quality while operating in the stealth-optimal Optimus regime.
| Method | Model | ASR ↑ | PPL ↓ | StrongReject ↑ | HarmBench % | LPG-86M |
|---|---|---|---|---|---|---|
| OURS | Llama-3 | 0.84 | 24.3 | 0.22 | 40.3% | 0.51 Mal |
| OURS | Tulu-3 | 0.98 | 38.7 | 0.21 | 46.4% | 0.90 Mal |
| AutoDAN | Vicuna-7B | 0.99 | 104.6 | 0.15 | 43.3% | 0.98 Mal |
| AutoDAN | Llama-2 | 0.38 | 141.5 | 0.12 | 29.3% | 0.92 Mal |
| AmpleGCG | Vicuna-7B | 0.15 | 43.0 | 0.11 | 13.7% | 1.00 Mal |
| PAIR | — | — | 43.6 | — | — | 0.64 Ben |
Every prompt in our dataset maps to one of 14 categories grounded in established cybersecurity taxonomies — including MITRE ATT&CK tactics T1059, T1068, T1110, T1566, and T1041.