Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval), demonstrating the effectiveness and robustness of both the dataset and the approach.

WARNING: This paper includes images and text that may be considered offensive.

MMSafe-PO Dataset

Example Image

Table 1: Comparison of MMSafe-PO with datasets on preference optimization and the safety of MLLMs. “-” denotes inapplicable results. MMSafe-PO features multimodal instructions, the conversational format, and paired responses ranked by humans.

MMSafe-PO Construction

Example Image

Figure 1: Overall pipeline for MMSafe-PO dataset construction.

We construct the MMSafe-PO dataset through modality interpretation, and an overview of this process is shown in Figure 1.

MMSafe-PO Analysis

Top Left Image
Table 2: Statistics of the MMSafe-PO Dataset. “Inst” represents the instruction and “Conv.” represents conversation history.
Bottom Left Image
Figure 2: (a) Illustration of the types of images used in multimodal instructions. (b) Distribution of conversation turns.
Right Image
Figure 3: Hierarchical category analysis on the safety issues in the MMSafe-PO dataset.

If you find this dataset/model/paper helpful, please cite the following:

@misc{li2025harmlessmultimodalassistantsblind,
      title={Towards Harmless Multimodal Assistants with Blind Preference Optimization}, 
      author={Yongqi Li and Lu Yang and Jian Wang and Runyang You and Wenjie Li and Liqiang Nie},
      year={2025},
      eprint={2503.14189},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.14189}, 
}
        

References