The Art of the Jailbreak:
Formulating Attacks Beyond Binary Scoring

A unified framework for generating, categorizing, and continuously evaluating adversarial jailbreak prompts — grounded in cybersecurity taxonomy and the Optimus two-dimensional metric.

0 Adversarial Prompts

0 Jailbreak Strategies

0 Attack Categories

0 LLM Labelers

Watch Pipeline

Adversarial prompt composition: strategies × harmful seeds → LLM → composed jailbreak prompts

REAL-TIME TOKEN GENERATION — strategy × seed → LLM → token stream

Abstract

Jailbreak attacks — adversarial prompts that bypass LLM alignment through purely linguistic manipulation — pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically.

This paper addresses that gap with three tightly integrated contributions: a 114,000-prompt cybersecurity-grounded compositional dataset, automated jailbreak generators via instruction fine-tuning, and Optimus — a two-dimensional, training-free metric J(S, H) that jointly captures semantic similarity and harmfulness probability.

Our generators achieve perplexity 24–39 versus 40–140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29–0.51 (LlamaPromptGuard-2-86M). Optimus exposes a continuous "stealth-optimal" regime — (S* ≈ 0.57, H* ≈ 0.43) — that binary ASR entirely collapses.

CONTRIBUTION 01

Large-Scale Compositional Dataset

912 in-the-wild strategies × 125 harmful seeds = 114k adversarial prompts, each labeled to one of 14 cybersecurity attack categories via six-model majority vote.

CONTRIBUTION 02

Automated Jailbreak Generation

Category-aware generator LLMs instruction-fine-tuned on Optimus-filtered subsets — synthesizing fluent jailbreaks from a simple harmful seed with no templates or gradient search.

CONTRIBUTION 03

Optimus: Two-Dimensional Metric

A continuous metric J(S,H) requiring no fine-tuning that exposes the stealth-optimal regime and provides per-category defender prioritization evidence binary ASR cannot supply.

Results

Evaluation Results

Our compositional fine-tuned models outperform token-level baselines across fluency, safety-evasion, and adversarial quality while operating in the stealth-optimal Optimus regime.

Llama-3 (Ours)

ASR0.84

PPL (lower = better)24.3

LlamaPG-86M Mal.0.51

Tulu-3 (Ours)

ASR0.98

PPL (lower = better)38.7

LlamaPG-86M Mal.0.90

AutoDAN (Baseline)

ASR0.99

PPL (lower = better)104.6

LlamaPG-86M Mal.0.98

Method	Model	ASR ↑	PPL ↓	StrongReject ↑	HarmBench %	LPG-86M
OURS	Llama-3	0.84	24.3	0.22	40.3%	0.51 Mal
OURS	Tulu-3	0.98	38.7	0.21	46.4%	0.90 Mal
AutoDAN	Vicuna-7B	0.99	104.6	0.15	43.3%	0.98 Mal
AutoDAN	Llama-2	0.38	141.5	0.12	29.3%	0.92 Mal
AmpleGCG	Vicuna-7B	0.15	43.0	0.11	13.7%	1.00 Mal
PAIR	—	—	43.6	—	—	0.64 Ben

Attack Taxonomy

14 Cybersecurity Attack Categories

Every prompt in our dataset maps to one of 14 categories grounded in established cybersecurity taxonomies — including MITRE ATT&CK tactics T1059, T1068, T1110, T1566, and T1041.

Backdoor Implantation

κ = 0.80 Substantial

Data Exfiltration

κ = 0.58 Moderate

Denial of Service

κ = 0.75 Substantial

Exploit Kit Delivery

κ = 0.48 Moderate

Fileless Attack

κ = N/A

Keylogging

κ = 0.80 Substantial

Malware

κ = 0.37 Fair

Phishing

κ = 0.68 Substantial

Social Engineering

κ = 0.60 Moderate

A10

Password Cracking

κ = 0.80 Substantial

A11

Privilege Escalation

κ = 0.60 Moderate

A12

Remote Code Execution

κ = 0.65 Substantial

A13

USB Based Attack

κ = 0.65 Substantial

A14

Other

κ = 0.58 Moderate

The Art of the Jailbreak:
Formulating Attacks Beyond Binary Scoring

Full Pipeline — End to End

Two-Phase Pipeline

Evaluation Results

14 Cybersecurity Attack Categories

The Art of the Jailbreak:Formulating Attacks Beyond Binary Scoring

Full Pipeline — End to End

Two-Phase Pipeline

Evaluation Results

14 Cybersecurity Attack Categories

The Art of the Jailbreak:
Formulating Attacks Beyond Binary Scoring