Introduction
With the enormous advances taking place in the field of large language models (LLM), models that can process multimodal input have recently moved to the forefront of the field. These models can take as input both text and images, and sometimes also other modalities, such as video or voice.
Multimodal models present unique challenges in evaluation. In this blog post, we will look at some multimodal datasets that can be used to evaluate the performance of such models, primarily those focused on visual question answering (VQA), where a question must be answered using information from an image. .
The landscape of multimodal datasets is broad and constantly growing, with benchmarks focusing on different perception and reasoning capabilities, data sources, and applications. The list of data sets here is by no means exhaustive. We will briefly describe the key characteristics of ten multimodal datasets and benchmarks and describe some key trends in the space.
Multimodal data sets
TextVQA
There are different types of vision and language tasks in which a generalist multimodal language model could be evaluated. One such task is optical character recognition (OCR) and answering questions based on the text present in an image. A data set that evaluates these types of skills is TextVQAa data set published in 2019 by Singh et al.
Two TextVQA examples (Singh et al., 2019)
Since the dataset focuses on text present in images, many images are of billboards, whiteboards, or traffic signs. In total, there are 28,408 images of the OpenImages dataset and 45,336 questions associated with them, requiring reading and reasoning about the text in the images. For each question, there are 10 real answers provided by the scorers.
DocVQA
Similar to TextVQA, DocVQA is about reasoning based on text in an image, but it is more specialized: in DocVQA, images are of documents, containing elements such as tables, forms and lists, and come from sources, for example, from the chemical or fossil fuel industries . There are 12,767 images of 6,071 documents and 50,000 questions associated with these images. The authors also provide a random split of the data into train (80%), validation (10%), and proof (10%) sets.
Examples of DocVQA question and answer pairs (Mateo et al., 2020)
OCRBanco
The two datasets mentioned above are far from the only ones available for OCR-related tasks. If a comprehensive evaluation of a model is desired, it can be expensive and time-consuming to perform the evaluation of all available test data. Because of this, samples from several related data sets are sometimes combined into a single benchmark that is smaller than the combination of all the individual data sets and more diverse than any single-source data set.
For OCR related tasks, one of those data sets is OCRBanco by Liu et al. It consists of 1000 manually verified question-answer pairs from 18 data sets (including TextVQA and DocVQA described above). The benchmark covers five main tasks: text recognition, scene text-focused VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition.
Examples of text recognition tasks (a), handwritten mathematical expression recognition (b), and scene text-focused VQA tasks (c) in OCRBench (Liu et al., 2023)
MathematicsView
Compilations of multiple data sets also exist for other specialized task sets. For example, MathematicsView by Lu et al. It focuses on mathematical reasoning. It includes 6,141 examples from 31 multimodal data sets involving mathematical tasks (28 previously existing data sets and 3 newly created ones).
Examples of annotated data sets for MathVista (Lu et al., 2023)
The data set is divided into two divisions: testmini (1,000 examples) for evaluation with limited resources, and proof (the remaining 5,141 examples). To combat model overfitting, the responses to the proof The division is not made public.
Logical view
Another relatively specialized ability that can be assessed in multimodal LLMs is logical reasoning. One data set that aims to do this is the recently published one Logical view by Xiao et al. It contains 448 multiple choice questions covering 5 logical reasoning tasks and 9 abilities. These examples are compiled from authoritative intelligence testing sources and discussed. Two examples of the data set are shown in the following image.
LogicVista Dataset Examples (Xiao et al., 2024)
RealWorldQA
Unlike narrowly defined tasks, such as those involving OCR or mathematics, some data sets cover broader, less restricted goals and domains. For example, RealWorldQA is a dataset of over 700 real-world images, with a question for each image. Although most images come from vehicles and depict driving situations, some show more general scenes with multiple objects in them. The questions are of different types: some have multiple choice options, while others are open-ended, with instructions included such as “Please answer directly with a single word or number.”
Examples of image combinations, questions and answers RealWorldQA
MMBank
In a situation where different models compete for the best scores on fixed benchmarks, overfitting from models to benchmarks becomes a concern. When a model overfits, it means that it will show very good results on a given data set, even though this strong performance does not generalize well enough to other data. To combat this, there is a recent trend to publish only the questions of a benchmark, but not the answers. For example, the MMBank The data set is divided into developer and proof subsets, and while developer is published along with the answers, proof it is not. This dataset consists of 3,217 image-based multiple-choice questions covering 20 specific skills, which the authors define as belonging to general groups of perception (e.g., object localization, image quality) and reasoning (e.g., prediction future, social relationship).
Results of eight vision-language models on the 20 skills defined in MMBench.proofas tested by Liu et al. (2023)
An interesting feature of the dataset is that unlike most other datasets where all questions are in English, MMBench is bilingual and English questions are additionally translated into Chinese (translations are done automatically using GPT -4 and then verified).
To check the consistency of model performance and reduce the chance of a model answering correctly by accident, the MMBench authors ask the models the same question multiple times with the order of the multiple-choice options mixed up.
MME
Another benchmark for the comprehensive assessment of multimodal skills is MME by Fu et al. This dataset covers 14 subtasks related to perception and cognition abilities. Some images in MME come from existing data sets and others are novel and taken manually by the authors. MME differs from most of the data sets described here in the way its questions are posed. All questions require a “yes” or “no” answer. To better evaluate the models, two questions are designed for each image, so that the answer to one of them is “yes” and the other “no”, and a model is required to answer both correctly to obtain a “point”. . for homework. This data set is intended for academic research purposes only.
Examples of the MME benchmark (Fu et al., 2023)
MMMU
While most of the datasets described above evaluate multimodal models on tasks that most humans could perform, some datasets focus on specialized expert knowledge. One of those reference points is MMMU by Yue et al.
Questions in MMMU require university-level subject knowledge and cover 6 main disciplines: art and design, business, science, health and medicine, humanities and social sciences, and technology and engineering. In total, there are more than 11,000 questions from textbooks, quizzes and university exams. Image types include diagrams, maps, chemical structures, etc.
Examples of MMMUs from two disciplines (Yue et al., 2023)
TVQA
The benchmarks mentioned so far incorporate two types of data: text and images. Although this combination is the most widespread, it is worth noting that more modalities, such as video or voice, are being incorporated into the large multimodal models. To give an example of a multimodal dataset that includes video, we can look at the TVQA dataset for Lei et al.which was created in 2018. In this dataset, some questions are asked about video clips between 60 and 90 seconds long from six popular TV shows. For some questions it is enough to use only the subtitles or only the video, while others require using both modalities.
Examples of TVQA (Lei et al., 2018)
Multimodal tickets at Clarifai
With the Clarifai platform, you can easily process multi-modal inputs. on this example notebookyou can see how the Gemini Pro Vision The model can be used to answer a question based on images from the RealWorldQA benchmark.
Key Trends in Multimodal Assessment Benchmarks
We’ve noticed some trends related to multimodal benchmarks:
- While in the era of smaller models specialized on a particular task a dataset would typically include both training and testing data (e.g. TextVQA), with the increasing popularity of generalist models pre-trained on large amounts of data , we see more and more data sets intended solely for model evaluation.
- As the number of available data sets grows and models become increasingly larger and more resource-intensive to evaluate, there is a trend toward creating curated collections of samples from multiple data sets for smaller-scale but more comprehensive evaluation. .
- For some data sets, answers, or in some cases even questions, are not made public. This is intended to combat overfitting of models to specific benchmarks, where good scores on a benchmark do not necessarily indicate generally strong performance.
Conclusion
In this blog post, we briefly describe some data sets that can be used to evaluate the multimodal capabilities of vision and language models. It should be noted that many other existing benchmarks were not mentioned here. The variety of benchmarks is generally very wide: some data sets focus on a narrow task, such as OCR or mathematics, while others aim to be more comprehensive and reflect the real world; some require general knowledge and some highly specialized; Questions may require yes/no, multiple choice, or open-ended response.