Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval), demonstrating the effectiveness and robustness of both the dataset and the approach.

WARNING: This paper includes images and text that may be considered offensive.

MMSafe-PO Dataset

Table 1: Comparison of MMSafe-PO with datasets on preference optimization and the safety of MLLMs. “-” denotes inapplicable results. MMSafe-PO features multimodal instructions, the conversational format, and paired responses ranked by humans.

MMSafe-PO Construction

Figure 1: Overall pipeline for MMSafe-PO dataset construction.

We construct the MMSafe-PO dataset through modality interpretation, and an overview of this process is shown in Figure 1.

Text-only preference data collection. Considering the quantity, quality, and the harmless alignment objective, we chose the Anthropic-HH dataset as the candidate preference data. Released by Anthropic, this dataset is exceptionally rare in the academic area because it includes genuine human feedback. Anthropic-HH addresses both helpfulness and harmlessness objectives and encompasses a diverse range of instructions from interactions between large language models and humans. Therefore, we can obtain high-quality multimodal preference data by transforming text-only instructions into multimodal ones.
Entity recognition and image matching. Our goal is to match images relevant to the instructions. To achieve this, we first recognize the entities within the instructions and then match images to these entities. Specifically, we utilize a mature entity recognition library bert-large-cased-finetuned-conll03-english to identify all entities and their attributes (e.g., person, organization, location) within the user instructions. Subsequently, we search for images relevant to the identified entities. We use the Wikipedia API to retrieve images of the entities. If this search fails, we supplement it with the Google Knowledge Graph API. Ultimately, for each identified entity, we obtain the most relevant image.
Instruction rephrasing. Given the original textual instruction, the identified entity, and the matched image, we aim to rewrite them as corresponding multimodal instructions. Since LLMs have been extensively used for multimodal instruction generation, we believe that simply rewriting the textual instruction as the text component of multimodal instructions is well within the capabilities of LLMs. Specifically, we use Qwen-VL-Chat to rewrite the textual instructions.

MMSafe-PO Analysis

Top Left Image — Table 2: Statistics of the MMSafe-PO Dataset. “Inst” represents the instruction and “Conv.” represents conversation history.

Bottom Left Image — Figure 2: (a) Illustration of the types of images used in multimodal instructions. (b) Distribution of conversation turns.

Right Image — Figure 3: Hierarchical category analysis on the safety issues in the MMSafe-PO dataset.

Statistical analysis. MMSafe-PO comprises 5,667 multimodal instructions, each containing a text and a corresponding image. For each instruction, there is a chosen response and a rejected response. Approximately 50.49% of instructions include chat history, as detailed in Table 2. It is important to note that we adhere to the original train and test split of the Anthropic-HH dataset, rather than creating a new split. This is to avoid potential data leaks because the LLM backbone of the MLLM may have been trained on the Anthropic-HH training set. The dataset statistics are summarized in Table 2.
Multimodal instruction analysis. On average, there are about 23.51 tokens in the multimodal instructions without the chat history. The input length increases to 145.13 when concatenated with the chat history. This extended length highlights the challenge of understanding the instructions from multimodal assistants. Additionally, we roughly categorize the identified entities into types such as people, organizations, and locations to illustrate the types of images included in the multimodal instructions, as shown in Figure 2.
Hierarchical category analysis. Since the multimodal instructions pertain to safety issues, it is necessary to analyze and categorize these issues. Following the work [1], we establish a classification system. This hierarchical classification consists of three levels, with the first level including categories such as “Representation & Toxicity Harms,” “Malicious Use, Information & Safety Harms,” “Misinformation Harms,” “Human Autonomy & Integrity Harms,” and “Socioeconomic Harms.” There are approximately 15 categories at the second level and 50 categories at the third level. We visualize the distribution of categories in Figure 3. It can be observed that MMSafe-PO covers various safety categories and exhibits a diverse distribution.

If you find this dataset/model/paper helpful, please cite the following:

@misc{li2025harmlessmultimodalassistantsblind,
      title={Towards Harmless Multimodal Assistants with Blind Preference Optimization}, 
      author={Yongqi Li and Lu Yang and Jian Wang and Runyang You and Wenjie Li and Liqiang Nie},
      year={2025},
      eprint={2503.14189},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.14189}, 
}

MMSafe-PO

Towards Harmless Multimodal Assistants with Blind Preference Optimization

Abstract

MMSafe-PO Dataset

MMSafe-PO Construction

MMSafe-PO Analysis

References