The Art of the Jailbreak:
Formulating Attacks Beyond Binary Scoring

A unified framework for generating, categorizing, and continuously evaluating adversarial jailbreak prompts — grounded in cybersecurity taxonomy and the Optimus two-dimensional metric.

0 Adversarial Prompts
0 Jailbreak Strategies
0 Attack Categories
0 LLM Labelers
Adversarial prompt composition: strategies × harmful seeds → LLM → composed jailbreak prompts

REAL-TIME TOKEN GENERATION — strategy × seed → LLM → token stream

Full Pipeline — End to End

Strategies compose with harmful seeds → 6 LLMs categorize simultaneously while optimal strategies are ranked → Optimus scores each prompt → Data tiering → Instruction fine-tuning.

Initializing…
cycle 0%
Abstract

Jailbreak attacks — adversarial prompts that bypass LLM alignment through purely linguistic manipulation — pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically.

This paper addresses that gap with three tightly integrated contributions: a 114,000-prompt cybersecurity-grounded compositional dataset, automated jailbreak generators via instruction fine-tuning, and Optimus — a two-dimensional, training-free metric J(S, H) that jointly captures semantic similarity and harmfulness probability.

Our generators achieve perplexity 24–39 versus 40–140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29–0.51 (LlamaPromptGuard-2-86M). Optimus exposes a continuous "stealth-optimal" regime — (S* ≈ 0.57, H* ≈ 0.43) — that binary ASR entirely collapses.

CONTRIBUTION 01
Large-Scale Compositional Dataset
912 in-the-wild strategies × 125 harmful seeds = 114k adversarial prompts, each labeled to one of 14 cybersecurity attack categories via six-model majority vote.
CONTRIBUTION 02
Automated Jailbreak Generation
Category-aware generator LLMs instruction-fine-tuned on Optimus-filtered subsets — synthesizing fluent jailbreaks from a simple harmful seed with no templates or gradient search.
CONTRIBUTION 03
Optimus: Two-Dimensional Metric
A continuous metric J(S,H) requiring no fine-tuning that exposes the stealth-optimal regime and provides per-category defender prioritization evidence binary ASR cannot supply.
System Architecture

Two-Phase Pipeline

An animated walkthrough of how the system transforms raw jailbreak strategies and harmful seeds into fine-tuned generator models.

PHASE 1: COMPOSITION & SYNTHESIS PHASE 2: OPTIMUS SCORING CORE DATABASE WildJailbreak 262K pairs DATABASE JailBreakV-28K 125 seeds STEP 1 Strategy Extraction forceful language deceptive framing roleplay persona 912 strategies extracted STEP 2 Automated Composition N × M Grid 114,000 PROMPTS STEP 3 — SEED CATEGORIZATION (6-LLM MAJORITY VOTE) A1 Backdoor A2 Data Exfil A3 DoS A4 Exploit Kit A5 Fileless A6 Keylogging A7 Malware A8 Phishing A9 Social Eng A10 Pwd Crack A11 Priv Escal A12 RCE A13 USB Attack A14 Other Majority Vote Agreement (6 LLMs) PIPELINE STEP 4 — OPTIMUS J(S, H) = Base(S,H) · P_S(S) · P_H(H) Score Range Safe Weak Mod Optimal Stealth-optimal: S* ≈ 0.57 · H* ≈ 0.43 STEP 5 Optimal Strategy Selection S1 — roleplay persona ●●● S2 — forceful language ●●○ S3 — deceptive framing ●○○ STEP 6 Instruction Fine-Tuning Llama-3 Tulu-3 Vicuna-7B LoRA · DoRA · 24k samples OUTPUT Jailbreak Generator LLM PPL: 24–39 ASR: 0.84–0.98 DATA TIERING Safe < 0.212 Weak 0.212 – 0.283 Moderate 0.283 – 0.377 ✓ Optimal 0.377 – 0.471 ✓ 24,220 selected for fine-tuning (Mod + Opt) 80% TRAIN · 10% VAL · 10% TEST
Phase 1: Composition
Phase 2: Optimus Scoring
Data Flow
Optimus Score Range
Results

Evaluation Results

Our compositional fine-tuned models outperform token-level baselines across fluency, safety-evasion, and adversarial quality while operating in the stealth-optimal Optimus regime.

Llama-3 (Ours)
ASR0.84
PPL (lower = better)24.3
LlamaPG-86M Mal.0.51
Tulu-3 (Ours)
ASR0.98
PPL (lower = better)38.7
LlamaPG-86M Mal.0.90
AutoDAN (Baseline)
ASR0.99
PPL (lower = better)104.6
LlamaPG-86M Mal.0.98
MethodModelASR ↑PPL ↓StrongReject ↑HarmBench %LPG-86M
OURS Llama-30.8424.30.2240.3% 0.51 Mal
OURS Tulu-30.9838.70.2146.4% 0.90 Mal
AutoDANVicuna-7B0.99104.60.1543.3% 0.98 Mal
AutoDANLlama-20.38141.50.1229.3% 0.92 Mal
AmpleGCGVicuna-7B0.1543.00.1113.7% 1.00 Mal
PAIR43.6 0.64 Ben
Attack Taxonomy

14 Cybersecurity Attack Categories

Every prompt in our dataset maps to one of 14 categories grounded in established cybersecurity taxonomies — including MITRE ATT&CK tactics T1059, T1068, T1110, T1566, and T1041.

A1
Backdoor Implantation
κ = 0.80 Substantial
A2
Data Exfiltration
κ = 0.58 Moderate
A3
Denial of Service
κ = 0.75 Substantial
A4
Exploit Kit Delivery
κ = 0.48 Moderate
A5
Fileless Attack
κ = N/A
A6
Keylogging
κ = 0.80 Substantial
A7
Malware
κ = 0.37 Fair
A8
Phishing
κ = 0.68 Substantial
A9
Social Engineering
κ = 0.60 Moderate
A10
Password Cracking
κ = 0.80 Substantial
A11
Privilege Escalation
κ = 0.60 Moderate
A12
Remote Code Execution
κ = 0.65 Substantial
A13
USB Based Attack
κ = 0.65 Substantial
A14
Other
κ = 0.58 Moderate