Artificial intelligence (AI) has made significant strides in recent years, successfully handling tasks involving text and images. However, its performance in solving complex problems, especially those involving mathematical reasoning and visual context, has raised questions about its reliability. While claims have been made about AI models like GPT-4 excelling in tasks like SAT math exams, not all of these assertions have held up under scrutiny.
A recent paper even retracted its claim that GPT-4 could earn a computer science degree at MIT. To gain a deeper understanding of how AI models handle problem-solving, a team of researchers from the University of California, Los Angeles, the University of Washington, and Microsoft Research has developed a new testing benchmark called MathVista. This benchmark focuses on visually oriented challenges to evaluate the capabilities of both large language models (LLMs) and large multimodal models (LMMs).
Introducing MathVista: A benchmark for visual problem solving
The researchers recognized the need to systematically examine the ability of foundation models to perform mathematical reasoning in visual contexts. To address this, they introduced MathVista, a testing benchmark that includes 6,141 examples drawn from 28 multimodal datasets and three new datasets named IQTest, FunctionQA, and PaperQA. MathVista encompasses various forms of reasoning, including algebraic, arithmetic, geometric, logical, numeric, scientific, and statistical. It focuses on tasks such as figure question answering, geometry problem solving, math word problems, textbook questions, and visual questions. Developing MathVista was essential to facilitating the advancement of mathematical reasoning with a visual component and evaluating how different AI models perform in reasoning tasks.
The importance of visual problem solving for AI
Visual problem-solving skills in AI are of paramount importance, especially in applications like autonomous vehicles. Demonstrating an AI model’s ability to correctly solve visual problems is critical for ensuring trust and safety in various domains.
AI models tested in MathVista
The research team evaluated a dozen foundation models, including three large language models (LLMs): ChatGPT, GPT-4, and Claude-2, two proprietary large multimodal models (LMMs): GPT4V and Bard, and seven open-source LMMs. In addition to AI models, they considered human answers provided by individuals with at least a high school degree through Amazon Mechanical Turk, as well as random responses.
AI Models outperform random chance
The study yielded some encouraging results for AI practitioners. All the tested LLMs and LMMs performed better than random chance, which is not surprising given that many of the questions in MathVista were multiple-choice rather than yes-or-no questions. OpenAI’s GPT-4V emerged as the top performer, surpassing human performance in specific areas, particularly in questions involving algebraic reasoning and complex visual challenges, such as those related to tables and function plots.
GPT-4V falls short of human performance
Despite GPT-4V’s impressive performance, it still fell short of human participants who underwent the same testing. While the AI model achieved an accuracy rate of 49.9 percent, human participants achieved a score of 60.3 percent. This 10.4 percent gap in overall accuracy highlights the need for further improvements in AI models’ ability to handle complex visual and mathematical reasoning tasks.
Implications for AI development
The findings from the MathVista benchmark underscore the potential and limitations of current AI models. While AI has made significant progress in various domains, there is still ample room for improvement in terms of handling complex visual and mathematical challenges. These results emphasize the importance of continued research and development to bridge the gap between AI and human problem-solving capabilities.
MathVista, the newly developed benchmark, sheds light on AI models’ performance in solving visually-oriented mathematical challenges. While AI models like GPT-4V have shown promise by surpassing random chance and excelling in specific areas, they still have a significant gap to bridge to match human-level performance. This research serves as a reminder of the ongoing quest to enhance AI’s problem-solving abilities, with implications for applications ranging from autonomous vehicles to healthcare and beyond. As AI continues to evolve, benchmarks like MathVista will play a crucial role in assessing and advancing its capabilities in solving complex problems.