Jailbreak Taxonomy

We classify the methods based on the following two criteria:

We consider whether the original forbidden question is modified to bypass the target LLM’s alignment mechanisms in the method.
If the original question is modified, we then consider how these modified prompts are generated in the method, such as through translation or adding prefixes and suffixes.

Jailbreak Taxonomy	Jailbreak Method	Require White-Box Access?	Modify the Original Question?	Initial Jailbreak Seeds?
Human-Based	AIM	✗	✓	/
Human-Based	Devmoderanti	✗	✓	/
Human-Based	Devmodev2	✗	✓	/
Obfuscation-Based	Base64	✗	✓	/
Obfuscation-Based	Combination	✗	✓	/
Obfuscation-Based	Zulu	✗	✓	/
Obfuscation-Based	DrAttack	✗	✓	✗
Heuristic-Based	AutoDAN	✓	✓	✓
Heuristic-Based	GPTFuzz	✗	✓	✓
Heuristic-Based	LAA	✗	✓	✓
Feedback-Based	GCG	✓	✓	✗
Feedback-Based	COLD	✓	✓	✗
Feedback-Based	PAIR	✗	✓	✗
Feedback-Based	TAP	✗	✓	✗
Fine-Tuning-Based	Masterkey	✗	✓	✓
Fine-Tuning-Based	AdvPrompter	✗	✓	✗
Generation-Parameter-Based	Generation Exploitation	✓	✗	/