Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Published in arXiv 2026, 2026

Vision-language models (VLMs) often exhibit weakened safety alignment when the visual modality is introduced—even adding a blank image can substantially increase jailbreak success rates. This paper challenges the prevailing “safety perception failure” hypothesis by analyzing VLM jailbreaks using explicitly harmful multimodal data. We observe that VLMs can clearly distinguish harmful from benign inputs in their representation space, and that jailbreak samples form a distinct internal state separable from both benign and refusal states. This suggests that jailbreaks do not arise from a failure to recognize harmful intent; instead, the visual modality shifts representations toward a specific jailbreak state where the model fails to trigger refusal despite recognizing the danger. We formalize this as the jailbreak-related representation shift—the component of the image-induced shift along a defined jailbreak direction. Based on this understanding, we propose JRS-Rem (Jailbreak-Related Shift Removal), a training-free defense that removes the jailbreak-related shift at inference time. Experiments across three VLMs and seven datasets show that JRS-Rem significantly enhances safety while preserving utility on benign tasks.

Recommended citation: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Wen Shen. (2026). "Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift." arXiv 2026.
Download Paper