MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

Introduction

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.

Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.

Leaderboard on

MM-IQ

Accuracy scores on the test set (2,710 problems) of MM-IQ.

#	Model	Type	Source	Date	Mean	LO	Math	2D-G	3D-G	VI	TM	SR	CO
1	*Human Performance 🥇**	-	Link	2024-09-04	51.27	61.36	45.03	60.11	47.48	46.67	55.61	36.63	65.79
2	o3 🥈	Proprietary	Link	2025-04-22	33.17	35.11	35.04	30.96	29.39	36.17	31.56	30.50	50.00
3	Claude-3-7-Sonnet-Thinking 🥉	Proprietary	Link	2025-02-19	31.55	32.57	30.87	33.69	29.62	31.91	28.43	35.59	57.89
4	Gemini-2.5-pro	Proprietary	Link	2025-03-25	31.23	33.33	32.15	30.96	28.79	36.17	27.23	34.74	42.10
5	Claude-3.5-sonnet	Proprietary	Link	2024-11-14	27.49	23.41	29.48	26.60	24.37	35.56	25.69	27.72	42.11
6	QVQ-72B-Preview	Open-Source 🖼️	Link	2024-11-22	26.94	28.91	25.59	29.23	26.38	26.67	25.43	22.77	34.21
7	GPT-4o	Proprietary	Link	2024-11-12	26.87	25.52	25.70	28.32	27.64	26.67	25.69	27.72	50.00
8	Gemini-1.5-pro-002	Proprietary	Link	2024-11-09	26.86	19.53	27.43	28.03	25.88	24.44	31.17	25.74	39.47
9	Qwen2-VL-72B-Instruct	Open-Source 🖼️	Link	2024-11-13	26.38	24.74	24.40	28.60	27.39	24.44	26.93	32.67	23.68
10	Qwen2.5-VL-7B-Instruct	Open-Source 🖼️	Link	2025-02-25	25.90	25.95	24.46	23.56	29.14	25.53	27.47	27.96	26.31
11	Deepseek-vl-7b-chat	Open-Source 🖼️	Link	2024-11-14	22.17	19.53	20.30	22.25	27.39	35.56	23.72	24.75	15.79
12	LLaVA-1.6-7B	Open-Source 🖼️	Link	2024-11-03	19.45	24.22	20.34	17.92	15.83	20.00	18.23	17.82	18.42

Human Performance*: Average human performance from annotators who have high school diplomas or above.
Reasoning Paradigm: LO: Logical Operation, 2D-G: 2D-Geometry, 3D-G: 3D-Geometry,
VI: Visual Instruction, TM: Temporal Movement, SR: Spatial Relationship, CO: Concrete Object.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Examples

Prompt: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity:

(Ground Truth: C)

A visual example of logical operation paradigm of MM-IQ.

Prompt1: Choose the most appropriate option from the given four options to present a certain regularity:

Option A: 4; Option B: 5; Option C: 6; Option D: 7

(Ground Truth: D)

Prompt2: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity:

(Ground Truth: A)

Two visual example of mathematical paradigm of MM-IQ.

Prompt: The option that best fits the given pattern of figures is ( ).

(Ground Truth: A)

A visual example of 2D-geometry paradigm of MM-IQ.

Prompt: Choose the most appropriate option from the given four options to present a certain regularity:

(Ground Truth: B)

A visual example of visual instruction paradigm of MM-IQ.

Prompt: Choose the most appropriate option from the given four options to present a certain regularity:

(Ground Truth: D)

A visual example of spatial relationship paradigm of MM-IQ.

Prompt: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity:

(Ground Truth: D)

A visual example of concrete object paradigm of MM-IQ.

Prompt: The one that matches the top view is:

(Ground Truth: C)

A visual example of 3D-geometry paradigm of MM-IQ.

Prompt:Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity:

(Ground Truth: C)

A visual example of temporal movement paradigm of MM-IQ.

@article{cai2025mm, title={MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models}, author={Cai, Huanqia and Yang, Yijun and Hu, Winston}, journal={arXiv preprint arXiv:2502.00698}, year={2025} }

MM-IQ

Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

Introduction

Leaderboard on MM-IQ

MM-IQ Dataset

Examples

BibTeX