Logo SceMQA

A Scientific College Entrance Level Multimodal Question Answering Benchmark

Zhenwen Liang†,1, Kehan Guo1, Gang Liu1, Taicheng Guo1, Yujun Zhou1,
Tianyu Yang1, Jiajun Jiao2, Jipeng Zhang3, Renjie Pi3, Xiangliang Zhang†,1

1University of Notre Dame,
2New York University,
3Hong Kong University of Science and Technology

†Corresponding to: zliang6@nd.edu, xzhang33@nd.edu

🔔News

🔥[2024-01-25]: Our website is now available online! 😆

Introduction

The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. It addresses a critical educational phase often overlooked in existing benchmarks, spanning high school to pre-college levels. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models’ abilities. Additionally, our benchmark provides specific knowledge point for each problem and detailed explanations for each answer. SceMQA also uniquely presents problems with identical contexts but varied questions to facilitate a more thorough and accurate assessment of reasoning capabilities. In the experiment, we evaluate two leading Multimodal Large Language Models (MLLMs), GPT4-V and Gemini Pro, across various experimental settings. The results show that further research and development are needed in developing MLLM capabilities, as highlighted by only 50% to 60% accuracy achieved by those strong baseline models.

Logo SceMQA Benchmark

Overview

Our benchmark is strategically designed to bridge a significant gap in existing multimodal benchmarks, which typically span from elementary to college levels, overlooking the crucial high school and precollege stages. This educational phase is crucial in the human learning process. Although some existing benchmarks (Zhong et al., 2023; Zhang et al., 2023a) incorporate problems at this level, they predominantly feature text-only questions. Our benchmark stands out as the first college entrance level multimodal benchmark within the research community, offering a more comprehensive assessment tool. Figure 3 demonstrates the intermediate difficulty of our benchmark, positioning between the existing benchmark on primary and college levels. We follow previous studies (Hendrycks et al., 2021a; Lewkowycz et al., 2022) and transform all mathematical expressions into latex codes, making them easy to process for LLMs.

multi_model_benchmark

Example problems in SceMQA, which contains four scientific subjects - math, physics, chemistry and biology in two formats - multiple choice and free response


variation

SceMQA contains multiple questions under the same context


Science subjects

Focusing on core science subjects, i.e., mathematics, physics, biology, and chemistry—our benchmark aligns with both existing text-only benchmarks, such as SciBench (Wang et al., 2023), and major human exams like the GaoKao (i.e., Chinese national college entrance exam). To effectively address these problems, AI models must demonstrate a robust understanding of images, tables, and diagrams, coupled with deep domain knowledge to recall necessary formulae, theorems, and other elements for advanced reasoning. This presents a suitable challenge for current AI systems, testing their limits in areas typically reserved for advanced human cognition.

Solution Explanation

We have meticulously annotated every problem in SceMQA. Almost all solutions (>90%) are accompanied by detailed, human verified explanations, crucial for identifying errors in model predictions, as shown in Figure 2. These explanations could be instrumental in future supervised fine-tuning (SFT) (Ho et al., 2022; Hsieh et al., 2023) and few-shot prompting methodologies (Wei et al., 2022).

Identified Knowledge Category

Additionally, each problem is categorized into specific knowledge components within its subject, also shown in Figure 2. This classification aids in building a knowledge state for the evaluated models, facilitating knowledge tracing and understanding the depth of the model’s capabilities.

Question Variation

Furthermore, our benchmark features a variety of questions based on the same image and context, as shown in Figure 4. Solving such kind of question sets has been demonstrated to be challenging for AI models (Liang and Zhang, 2021), where they usually fail to capture such small variations in the question part (Patel et al., 2021). This benchmark setting can not only test the depth of understanding and reasoning capabilities of these models (Patel et al., 2021; Yang et al., 2022) but also have the potential to support advancements in Socratic learning (Shridhar et al., 2022) and interpretable reasoning (Zhang et al., 2021).

In terms of scale, our benchmark comprises a significant volume of problems, ensuring a thorough evaluation across all included subjects. The total number of problems in SceMQA is 1,045, with an average of 261 problems per subject.

Comparisons with Existing Benchmarks

To underscore the distinctiveness of SceMQA in the landscape of AI benchmarks, we emphasize its specific features in Figure. SceMQA is strategically positioned to address a critical gap in educational levels targeted by existing benchmarks. While benchmarks like ScienceQA focus on elementary and middle-school levels, and MMMU targets college-level complexities, SceMQA is uniquely designed for the high school or college entrance level. This level, epitomized by significant examinations such as the AP tests in the U.S. and China's Gaokao, is crucial in human learning but has been underrepresented in AI benchmarks.

In terms of difficulty level, SceMQA offers a meticulously calibrated challenge that is neither too elementary nor excessively advanced, filling a vital niche in the progression of AI training and evaluation. Moreover, SceMQA stands out in its commitment to explanation quality. Unlike many benchmarks that provide limited or no explanations for their problems, a significant percentage of SceMQA's problems are accompanied by detailed, human-verified explanations. This ensures not only an assessment of the AI's answer accuracy but also its reasoning process, a critical aspect in the advancement of AI understanding and interpretability.

Figure visually represents these aspects, illustrating the unique positioning of SceMQA in terms of difficulty level and the proportion of problems with explained solutions. This comparative analysis highlights how SceMQA fills a crucial educational and developmental niche, offering a more nuanced and relevant challenge for advancing AI capabilities in multimodal reasoning, particularly in the science domain.

benchmark compare 1

The comparison between SceMQA and other existing benchmarks.
Y-axis is the percentage of problems that have detailed solution explanations.

benchmark compare 2

A comparative overview of various benchmarks. The first column indicates the problem types inside the benchmark, with 'MC' representing multiple choice and 'FR' indicating free-response formats. The second column shows the average number of problems per subject. The third column describes the problem modality, where 'I' stands for image-based and 'T' for text-based problems. (*) The fourth column categorizes benchmarks based on whether over 90\% of problems are annotated with solutions explanations. The final column presents the difficulty level. All superior and unique features of our benchmark are highlighted.

Statistics

Experiment Results

Leaderboard

In our evaluation, we focus on two of the most representative Multimodal Large Language Models (MLLMs) currently available: GPT4-V and Gemini Pro. We tested these models under three distinct settings: zero-shot, few-shot, and text-only. In the zero-shot setting, the models are provided with the problem without any prior examples. The few-shot setting involves giving the models a small number of example problems and solutions to ‘learn’ from before attempting the new problems. We use hand-crafted text-only problems as examples since it is not flexible to insert multiple images in one API call. The text-only setting is a unique approach under zero-shot where only the textual content of the problem is provided to the model, without any accompanying images. All the prompts used in our experiments, along with detailed descriptions of each setting, are available for public view and replication in our Github repository.

For the evaluation metric, we have chosen to use exact-match-based accuracy, which is consistent with several prior studies (Lu et al., 2023; Yue et al., 2023a)in this domain. This metric is particularly suitable for our benchmark as both the multiple-choice and free-response problems have definitive, singular correct answers. In the multiple-choice format, this involves selecting the correct option out of the presented choices. For the free-response format, it requires generating an accurate and precise answer, be it a numerical value, a yes/no response, or a specific term for fill-in-the-blank questions. Empirically we use rule-based answer exaction for multiple choice questions, and GPT4 as evaluators for free response questions.

Open-sourced models
Model Multiple Choice Free Response
Math Physics Chemistry Biology Overall Math Physics Chemistry Biology Overall
InstructBLIP-7B 16.98 21.86 20.30 22.75 20.48 6.00 6.00 0.00 38.00 12.50
InstructBLIP-13B 19.34 19.53 17.33 28.91 21.31 8.00 12.00 4.00 30.00 13.50
MiniGPT4-7B 18.87 20.93 25.25 22.75 21.90 4.00 0.00 2.00 20.00 6.50
MiniGPT4-13B 27.39 20.93 27.23 35.55 27.74 2.00 4.00 8.00 14.00 7.00
LLaVA1.5-7B 25.94 25.12 21.78 36.97 27.50 10.00 4.00 2.00 26.00 10.50
LLaVA1.5-13B 31.13 28.37 26.24 38.86 31.19 12.00 4.00 4.00 32.00 13.00
Yi-VL-6B 43.87 26.98 28.79 48.37 37.14 2.00 2.00 2.00 16.00 5.50
Deepseek-VL-Chat-7B 24.53 21.86 26.26 34.42 26.79 6.00 10.00 6.00 34.00 14.00
InternLM-XComposer2-7B 29.25 26.98 31.82 33.95 30.48 8.00 4.00 10.00 30.00 13.00
Qwen-VL-chat 25.47 23.72 22.22 34.42 26.55 4.00 0.00 0.00 24.00 7.00
Closed-sourced models
Model Setting Multiple Choice Free Response
Math Physics Chemistry Biology Overall Math Physics Chemistry Biology Overall
Google Bard Text-only 43.40 40.93 24.75 54.88 41.31 - - - - -
Gemini Pro Text-only 21.70 19.53 32.51 46.51 30.06 8.00 6.00 8.00 38.00 15.00
Few-shot 36.79 30.23 37.44 48.84 38.34 18.00 12.00 12.00 36.00 19.50
Zero-shot 37.26 30.70 42.36 54.42 41.18 20.00 12.00 18.00 36.00 21.50
GPT4-V Text-only 35.38 47.91 58.13 63.72 51.24 12.00 24.00 28.00 22.00 21.50
Few-shot 54.72 53.95 58.62 67.44 58.70 30.00 24.00 30.00 48.00 33.00
Zero-shot 55.19 55.81 60.10 72.09 60.83 36.00 24.00 36.00 48.00 36.00

Accuracy of examining GPT4-V and Gemini Pro across different settings on Multiple Choice and Free Response problems in SceMQA

Error Analysis

To delve deeper into the shortcomings of state-of-the-art Multimodal Large Language Models (MLLMs), we conducted a comprehensive error analysis. We randomly selected 150 instances of errors made by GPT4-V on the SceMQA dataset and enlisted two human experts for a detailed examination. These experts categorized each error into one of six categories: Image Perceptual Errors, Reasoning Errors, Lack of Knowledge, Rejection to Answer, Annotation Error, and Answer Extraction Error. The inter-rater reliability, assessed using the Kappa agreement score, was found to be greater than 0.5, indicating a moderate level of agreement between the annotators. We then averaged their annotations to determine the proportion of each error type, as depicted in the pie chart below.

error distribution

Error distribution on GPT4-V across 100 samples

The top-3 error types are shown and analyzed below.

Reasoning Error

The most prevalent error type is categorized under Reasoning Error. It occurs when the model correctly processes image-base information but fails to construct an accurate reasoning chain to arrive at the correct answer. Common mistakes include omitting necessary steps or making incorrect calculations. These errors underscore the need for further development in the reasoning abilities of MLLMs. Drawing on insights from studies on LLMs, approaches such as prompting engineering (Wei et al., 2022) or supervised fine-tuning (Yu et al., 2023; Yue et al., 2023b) might prove beneficial.

Image Perception Error

This occurs when the model misinterprets visual information—such as incorrectly reading numbers or coordinates, or failing to differentiate between points in a geometric diagram. The frequency of this error suggests that the image perception capabilities of current MLLMs require significant enhancement for precision and interpretation. Incorporation of external tools like OCR, as suggested in studies like (Liu et al., 2023), could potentially improve the model’s understanding of visual content.

Lack of Knowledge

Lack of Knowledge errors arise when the model fails to correctly identify or apply relevant knowledge concepts, such as misusing formulas or misinterpreting theorems. These errors are indicative of gaps in the model’s learned knowledge base. Enriching the training datasets of foundation models with diverse and domain-specific knowledge is essential to enhance their expertise in scientific domains.

Rejection to Answer and Annotation Error

Interestingly, a smaller yet significant portion of errors were categorized as Rejection to Answer and Annotation Error. Rejection to Answer occurs when the model refuses to provide an answer, possibly due to uncertainty or inability to comprehend the query. Annotation Error, on the other hand, arises from inaccuracies or inconsistencies in the dataset’s annotations, leading to confusion for the model. These categories highlight the importance of robust dataset design and the need for models to handle ambiguous or complex scenarios effectively.

Through this detailed error analysis, we have identified specific patterns and weaknesses in MLLM performance on scientific problems. These findings provide valuable insights and directions for future research aimed at enhancing the capabilities of MLLMs. Addressing these identified issues could lead to significant improvements in the application of MLLMs in educational and research contexts, particularly in the domain of science.

Error Examples

error examples

BibTeX


      @article{example bib,
        title={SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark},
        author={Liang, Zhenwen and Guo, Kehan and Liu, Gang and Guo, Taicheng and Zhou, Yujun and Yang, Tianyu and Jiao, Jiajun and Zhang, Jipeng and Pi, Renjie and Zhang, Xiangliang},
        journal={arXiv example},
        year={2024},
      }