Fresh Scoop Today

Mathematicians have devised new problems to challenge the most advanced AI systems' reasoning capabilities -- and they failed almost every test

By Stephanie Pappas

Mathematicians have devised new problems to challenge the most advanced AI systems' reasoning capabilities  --  and they failed almost every test

The researchers tested six state-of-the-art AI models against the new benchmark and the best score registered by a single system was 2%. (Image credit: hh5800/Getty Images)

Mathematicians have stumped the most advanced generative artificial intelligence (AI) models with a series of mind-bending new math problems.

These problems typically require doctorate-level mathematicians hours to days to solve, according to the research institute Epoch AI. But in the new tests, the most advanced AI models on the market got correct answers on less than 2% of these problems.

In the past decade, a number of AI tests have been developed to determine whether the answers these models return are actually correct. In many cases, AI models now breeze through these benchmarks.

For example, in the commonly used Measuring Massive Multitask Language Understanding (MMLU) benchmark test, today's AI models answer 98% of math problems correctly.

Most of these benchmarks are geared toward testing AI's ability to do high-school and college-level math, Elliot Glazer, a mathematician at Epoch AI, and colleagues wrote in a new paper posted on the preprint database arXiv. (The paper has not yet been peer-reviewed or published in a scientific journal.)

Related: Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

The new set of benchmarks, called FrontierMath, aims for a higher level of reasoning. Epoch AI developed the questions with the help of mathematics professors, including some winners of the Fields Medal, perhaps the most prestigious prize in math. The problems cover a wide range of subfields, from number theory to algebraic geometry, and are available on Epoch AI's website.

Sign up for the Live Science daily newsletter now

Get the world's most fascinating discoveries delivered straight to your inbox.

Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

"These are extremely challenging," 2006 Fields Medal winner Terence Tao, a mathematician at UCLA, wrote in a review of the problems for Epoch AI. "I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages."

The problems were also unique -- a step taken to ensure that none of the problems were already in the AI models' training data. When complex reasoning problems are included in the training data, the AI may appear to solve the problems, but in reality, it already has a "cheat sheet," since it has been trained on the answers.

The researchers tested six state-of-the-art AI models: Google's Gemini 1.5 Pro (002), Anthropic's Claude 3.5 Sonnet, OpenAI's o1-preview, o1-mini, and GPT4o and xAI's Grok-2 Beta. Gemini and Claude managed to solve 2%, which was just slightly better than the showings from o1-preview, o1-mini and GPT-4o's 1%. Grok-2 Beta failed to get any problems right.

However, these rankings are misleading because the low success rate means that a single right answer can have an outsize impact on each model's overall score, the researchers cautioned.

RELATED STORIES

-- Claude 3 Opus has stunned AI researchers with its intellect and 'self-awareness' -- does this mean it can think for itself?

-- New Chinese AI model 'better than industry leader' in key metrics

-- 'Student of Games' is the 1st AI that can master different types of games, like chess and poker

"[E]ven when a model obtained the correct answer, this does not mean that its reasoning was correct," the paper authors wrote. "For instance, on one of these problems running a few simple simulations was sufficient to make accurate guesses without any deeper mathematical understanding. However, models' low overall accuracy shows that such guessing strategies do not work on the overwhelming majority of FrontierMath problems."

The findings show that right now, AI models don't possess research-level math reasoning, Epoch AI's collaborators concluded. However, as AI models advance, these benchmark tests will provide a way to find out if their reasoning abilities are deepening.

"By regularly evaluating state-of-the-art models and collaborating with the AI research community," the team wrote in the statement, "we aim to deepen our understanding of AI's capabilities and limitations."

Previous articleNext article

POPULAR CATEGORY

entertainment

8995

discovery

4068

multipurpose

9479

athletics

9364