Supercharging MLLMs and LVLMs
Multi-modal Robustness benchmark (MMR) and Text-relevant Visual Token Selection (TVTS) developed for a better, open video AI.
Last updated
Multi-modal Robustness benchmark (MMR) and Text-relevant Visual Token Selection (TVTS) developed for a better, open video AI.
Last updated
Multimodal Large Language Models (MLLMs) demonstrate impressive abilities in understanding visual content and generating detailed responses. However, they often struggle when faced with misleading or leading questions, which can cause them to provide incorrect answers despite their comprehension. To address this issue, we introduced the Multi-modal Robustness Benchmark (MMR), designed to evaluate MLLMs using paired positive and misleading questions across categories like character, attribute, and context. The results revealed vulnerabilities in MLLMs when handling adversarial prompts, highlighting the need for improved robustness. By developing a new training set with paired positive and negative visual question-answer samples, we significantly enhanced the models' ability to handle misleading questions.
Our MMR benchmark evaluates both the understanding of visual content and the model’s resilience against misleading prompts. Questions were designed to test different aspects of visual comprehension, from character recognition to more complex, context-based inquiries. To measure this, we introduced metrics like the Misleading Rate (MR) and Robustness Accuracy (RA), providing a detailed understanding of model performance. With our new training approach, which focuses on paired positive and negative samples, we improved MLLMs’ robustness, allowing them to maintain accuracy and reliability even when facing challenging adversarial questions.
In addition to enhancing robustness, we tackled the issue of hallucinations in Large Vision-Language Models (LVLMs), where models generate details not present in the visual input. This often stems from attention collapse, where the model focuses disproportionately on a small subset of tokens. To prevent this, we developed the Text-relevant Visual Token Selection strategy, which distributes attention more evenly across relevant visual tokens. This strategy effectively reduced hallucinations by ensuring that models stay focused on actual visual content, leading to more accurate and faithful outputs compared to earlier models like LLaVA-1.5.
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual understanding and reasoning, which are crucial for tasks like autoregressive video generation. However, despite their ability to comprehend visual content, MLLMs often provide incorrect answers when faced with leading or misleading questions. This vulnerability indicates a lack of robustness to adversarial prompts, which can undermine the effectiveness of models in applications that require precise interpretation of visual data. Addressing this challenge is essential for improving the reliability of MLLMs in complex multimodal tasks.
To comprehensively measure MLLMs' understanding capability and robustness to leading questions, we introduce a multi-modal robustness benchmark (MMR). MMR contains paired positive and negative questions across 12 categories, meticulously annotated by humans. We evaluate 18 leading MLLMs on the MMB benchmark, revealing that MLLMs suffer from fragility to leading questions despite understanding the visual content. To enhance MLLMs' understanding capability and robustness, we further present a training set with paired positive and negative visual question-answer samples. Experiments verify that MLLMs' robustness can be significantly enhanced by tuning this new training set.
Benchmark Design. As shown in Figure 7, we design paired positive and negative questions for each sample. Each question has four brief options, with only one correct answer. The positive questions assess the model's ability to understand the visual content correctly, while the negative questions evaluate the model's robustness in answering correctly despite misleading prompts.
To achieve a comprehensive evaluation, we manually construct 300 positive and 300 leading negative questions across three levels: character, attribute, and context. Character-level questions prompt identifying elements like characters or numbers, while attribute-level questions focus on properties such as color, texture, and quantity. Context-level inquiries delve into higher-level concepts like emotions, culture, and common sense. The positive questions aim to evaluate the model's understanding ability, while the misleading ones challenge its resistance to interference. For instance, character-level misleading questions alter elements like characters or numbers, attribute-level ones confuse properties, and context-level prompts introduce complex concepts, thoroughly testing the model's resilience against misleading information.
Evaluation metrics. To quantitatively evaluate whether an MLLM comprehends visual content and its robustness against misdirection, we developed metrics grounded in Bayesian conditional probabilities. Specifically, we categorize the results into four groups: 1) Understanding and Robustness (UR), 2) Understanding but Fragility (UF), 3) not Understanding and Rigorous (NR), and 4) not Understanding and Fragility (NF). To quantify the effect of negative questions in misleading the model despite its comprehension of visual content, we introduce the Misleading Rate (MR)''. Additionally, we introduce the
Robustness Accuracy (RA)'' to evaluate the actual understanding capability of MLLMs. The formulation is as follows:
Where N_i represents the number of samples. By combining these two evaluation metrics, our benchmark can not only reflect the model's basic understanding capabilities but also reveal its robustness to misleading prompts.
To enhance MLLMs' understanding capability and robustness, we propose a data construction pipeline that utilizes GPT-4V to generate paired positive and negative samples for instruction tuning, as shown in Figure 8. To construct more accurate data, we first implicitly extract the information within the image during the prompt design and then generate paired positive and negative samples.
Information extraction. Previous approaches that generate instruction-tuning data for multimodal conversations with GPT-4V primarily fall into two categories: direct generation and annotation-driven generation. Direct generation methods. However, these methods fail to fully utilize the fine-grained details. In contrast, our data construction pipeline adopts a different approach by implicitly and comprehensively extracting this detailed information directly from the image. This includes 1) text or numbers (if present), 2) objects and people, 3) object attributes (\eg, colors, textures, locations) and human characteristics (\eg, postures, dresses), 4) interrelationships between people and objects, 5) events or activities, and 6) the overall feeling and understanding evoked by the image. The detailed prompts are in the Appendix.
Instruction tuning data generation. We generate positive samples using extracted information and construct negative samples that directly contradict the positive ones. This ensures a strong contrast for more effective model training, providing richer context and paired positive and negative samples. Detailed prompts are in the Appendix.
Framework. The structure consists of a visual encoder, a visual-language connector, and a language model. Specifically, we employ SigLIP [71] as the visual encoder, Phi-2 (2.7B) [42] as the language model, and a two-layer MLP as the connector. We build the model based on Bunny [19].
Training. Given an image and text, the image is processed through a visual encoder to obtain visual features. These features are then adjusted via a connector to align with the dimensions of the language model. The text is tokenized to generate textual features. These visual and textual features are concatenated and fed into the language model for generating responses. During training, each sample comprises instruction and response. The instructions are masked, and only the response and the model's output are used to calculate the loss.
Evaluation benchmark. We evaluate the performance of different MLLMs with the MMR benchmark, and general benchmarks. We utilize commonly used general benchmarks: MME perception MME cognition [13], MMBench test split, MMBench dev split [39], SEED-Bench-1 (SEED) [32], MMMU validation split, MMMU test split [70], VQA-v2 test-dev split [15], GQA test-dev-balanced split [24], and the average F1-score across random, popular, and adversarial categories on the validation set of MSCOCO (POPE) [34]. This comprehensive validation ensures robust evaluation across diverse metrics and scenarios.
We compare the generated dataset with previous GPT-4V [44] generated instruction tuning datasets, such as LLaVA 158k, SVIT 158k, and LRV on the proposed MMR benchmark and general benchmarks (as shown in Table 4). The results demonstrate that our dataset enables the model to outperform the model trained with previous datasets on both MMR and general benchmarks.
Our above work explored the intricate challenges of MLLMs in accurately interpreting visual content and generating precise responses to intrusive questions. We revealed that MLLMs may provide inaccurate answers despite demonstrating a nuanced understanding of visual content. To quantitatively assess the phenomenon in MLLMs, we introduced MMR, a comprehensive evaluation framework tailored to measure MLLMs' comprehension of visual content and their resilience to negative questions. Additionally, we proposed a data construction pipeline and introduced high-quality fine-tuning data to enhance the robustness and understanding abilities of MLLMs. Our research underscores the critical need for improved evaluation methodologies and data strategies to advance MLLM capabilities in real-world scenarios.
In LVLMs, hallucinations can often be traced to a phenomenon we term as attention collapse. This occurs when the attention mechanism focuses excessively on a small subset of tokens, neglecting other relevant tokens. In this context, we define the anchor token as the dominant token that accumulates most of the attention, effectively guiding the model's behavior during the decoding process.
Formally, let represent the attention weight matrix in an LVLM, where n is the number of visual tokens. The element A_{ij} represents the attention weight from token i to token j:
where e_i is the embedding of token $i$. As the model progresses through layers, certain tokens (i.e., anchor tokens) begin to dominate the attention. This can be quantified through the eigenspectrum of the matrix A. A concentration of attention around a few tokens results in a low variance in the eigenspectrum, indicative of attention collapse.
The anchor token is identified as the token corresponding to the largest eigenvalue in the eigenspectrum of the attention matrix. The small variance in the eigenvalues suggests that most attention is being aggregated toward a small subset of tokens, leading to hallucinations. In multimodal tasks, these anchor tokens often correspond to text tokens rather than visual tokens, exacerbating the disconnect between the generated output and the input image.
To address this, we propose the Text-relevant Visual Token Selection strategy, which aims to prevent attention collapse by suppressing redundant visual tokens early in the decoding process. Let V = {v_1, v_2, ..., v_n} represent the set of visual tokens and T = {t_1, t_2, ..., t_m} represent the set of text tokens. We compute a relevance score R(v_i, t_j) for each visual token v_i with respect to each text token t_j using the following attention mechanism:
where f(v_i) and g(t_j) are the respective embeddings for the visual and text tokens.
The set of selected visual tokens V{selected} is determined by filtering out tokens with relevance scores below a threshold:
This selection process helps maintain attention diversity by ensuring that the attention is distributed across a broader set of visual tokens. By reducing the influence of anchor tokens, we minimize the risk of hallucination during text-image generation.
To validate our approach, we compare it against the LLaVA-1.5 model using an image of a girl on a tennis court (Figure 9). The description generated by LLaVA-1.5 exhibits slight hallucination, adding details such as a fence and a bench that are not present in the image. In contrast, our model generates a description that is more faithful to the actual content of the image, avoiding the hallucinated details.
Our model's description is more faithful to the actual scene, avoiding hallucinations such as the "fence" and
bench" mentioned in LLaVA-1.5's description. This is due to our token selection strategy, which prevents attention collapse by distributing attention across relevant visual tokens.
The Text-relevant Visual Token Selection strategy effectively mitigates multimodal hallucinations in LVLMs by ensuring that attention is distributed across diverse and relevant visual tokens, rather than collapsing onto a small set of anchor tokens. Through eigenspectrum analysis, we have shown that attention collapse, characterized by low variance in the eigenspectrum, can lead to hallucinations. Our empirical comparison demonstrates that our method generates more accurate descriptions, free of hallucinated details when compared to existing models like LLaVA-1.5. Conclusion
Conclusion We build upon the advancements in Multimodal Large Language Models (MLLMs) and address critical challenges, particularly focusing on improving robustness and mitigating hallucinations. Specifically, we introduce the MultiModal Robustness (MMR) benchmark, which evaluates MLLM performance against misleading and leading questions. By enhancing model training with paired positive and negative visual question-answer samples, we significantly improve MLLM resilience to adversarial inputs. Furthermore, we delve into mitigating hallucinations in Large Vision-Language Models (LVLMs), proposing the Text-relevant Visual Token Selection strategy to combat attention collapse. This novel approach effectively distributes attention across relevant visual tokens, reducing the over-reliance on anchor tokens, thus ensuring the generated content aligns more accurately with visual inputs.