Can AI actually solve real math proofs? Researchers put it to the test ‣ Diverse Daily

Kendra Pierre-Louis: For Scientific American’s Science Quickly, I’m Kendra Pierre-Louis, in for Rachel Feltman.

In 1997, Deep Blue, a supercomputer built by IBM, did the unexpected: it defeated chess giant Garry Kasparov at his own game, leading to a flurry of headlines about whether Deep Blue was truly intelligent and if computers could now outthink humans. The answer, at least then, was mostly no.

But it’s now 2026, and we have a growing number of generative AI models that are once again making us wonder, “Can machines outthink us?” To dig into this question, a group of researchers aren’t turning to chess this time—they’re looking to math.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

To learn more about that, I talked to Joe Howlett, a staff reporter here at SciAm covering math. Thanks for joining us today, Joe.

Joe Howlett: Thank you for having me.

Pierre-Louis: So you wrote a piece that’s talking about the challenges of AI and math. Before we kinda get into the meat and potatoes of that piece, I have a—maybe a more basic question for you.

Howlett: Yeah.

Pierre-Louis: For those of us who maybe peaked with high-school algebra, when you’re talking about AI and math problems, what are the kind of math problems we’re really talking about?

Howlett: That’s actually a lot of what this story’s about, is that the kind of questions that mathematicians ask and spend their time thinking about kind of don’t really sound like or have anything in common with the problems that we work on for homework in math class.

Pierre-Louis: Mm-hmm.

Howlett: If you’ve recently taken a math class, you’re used to problems that have answers, right?

Pierre-Louis: Mm-hmm.

Howlett: And the answer is, like, a number …

Pierre-Louis: Yep.

Howlett: Or something. And you hand in your homework, and the teacher can check that number [Laughs], if it’s the right number or the wrong number, and they give you a grade.

But what research mathematicians are doing is trying to prove that statements are either true or false about the mathematical universe. So what does that mean? Like, you know about triangles and squares and basic shapes, but there’s …

Pierre-Louis: I did graduate from kindergarten, yes. [Laughs.]

Howlett: [Laughs.] That’s right, exactly. That’s about as far as I made it, too.

There’s way more complicated shapes that exist in many dimensions and have weird curvatures that you can’t even picture in your mind. But mathematicians are able to say things about them, right? Using equations and using proofs, they’re able to learn about these objects that we can’t actually see or picture.

Pierre-Louis: So now that we kind of know what math is, in [one of your pieces] you note that LLMs have had some mathematical wins, like Google Gemini Deep Think achieved a gold-level score on the International Mathematical Olympiad and that AI has solved multiple “Erd&odblac;s problems.” Why isn’t that enough to show AI’s math prowess?

Howlett: Yeah, I mean, the thing about most of these so-called benchmarks, is what they call ’em—for a lot of reasons AI companies have fixated on mathematics as, like, the next thing to prove …

Pierre-Louis: Mm-hmm.

Howlett: That LLMs can think, or to take a step towards intelligence. But most of those examples, like you said, they have more in common with the kind of test questions and homework problems that we were just talking about, not really looking like …

Pierre-Louis: Mm-hmm.

Howlett: Research math, right, which is more about proving statements about the world and exploring that world, posing questions that are interesting.

So in a way all of those accomplishments are very impressive. [Laughs.] It’s crazy that a computer can win gold on the math IMO …

Pierre-Louis: Mm-hmm.

Howlett: But it doesn’t say much about whether and to what extent a computer can advance mathematics, right, on its own, or even with the help of a human.

Pierre-Louis: Kind of like the difference between a really good calculator and a mathematician.

Howlett: Exactly! Yeah. Like, mathematicians have come across—in the history of mathematics, new tools have been invented time and again that have been useful for mathematicians and have accelerated things. And one of the big questions here [is]: Is this just another one of those tools, or is this gonna fundamentally revolutionize how mathematics is done at a level that we’ve never seen before? And it’s kind of too early to say.

Pierre-Louis: And one of the ways it seems that people are trying to suss out whether AI is kind of just a giant calculator or can really advance math is this First Proof challenge that was put together by a group of 11 mathematicians. Can you explain what this challenge was?

Howlett: Yeah, so these mathematicians who are, like, luminaries in their various fields of mathematics—and they cover a broad range of subfields in mathematics—they wanted to rectify this situation where we don’t really have a good sense of how good AI is at posing and solving real research math problems.

All of them have had this anecdotal experience where LLMs have gotten a lot better in just the last few months at interrogating mathematical questions kind of in the way a mathematician would and at proposing proofs and methods of proof that seem to bear out in some situations. But then they also hallucinate a lot, and they propose a lot of very confident nonsense.

So these mathematicians—who, by the way, don’t work for AI companies, right …

Pierre-Louis: Mm-hmm.

Howlett: They decided to get together and pose actual research questions that they are trying to solve for their own mathematical research, right? So each of them has papers that are coming out with proofs, and each of them took a little section of that. Proofs—the way mathematicians do proofs is they break them up into smaller theorems, right? So if you wanted to prove that seven is bigger than three, you might first prove that seven is bigger than five, and then prove that five is bigger than three, right? And that’s kind of how mathematicians work. And these smaller proofs are called “lemmas.”

What these mathematicians did is they each took from an upcoming paper a lemma that they proved as part of their bigger proof and picked it out of that paper, posed it as a problem for an LLM and did all of this before uploading that paper to any online place so that it’s not in the training data of the LLMs, right?

Pierre-Louis: Mm-hmm.

Howlett: ’Cause any math problem that I could pose an LLM has probably been posed before and probably an answer exists on the Internet. So these are real cutting-edge research questions, and if an LLM can solve them, then it would be, like, substantially able to contribute to the practice of doing math.

Pierre-Louis: So what are the early results from running this kind of challenge?

Howlett: Yeah, so for this first round, different AI companies, using their best models and a lot of mathematicians on staff, tried their hand at the problems, and we can’t really see the practice that they put into place. We can’t see, in some cases, their full transcript with the chatbots.

Pierre-Louis: Mm-hmm.

Howlett: We don’t know to what extent they consulted with human mathematicians.

And as one of the First Proof team [members], Lauren Williams, said to me, once there’s humans involved in the process at all, it becomes really hard to say how much the humans are doing and how much the AI are doing. So the team really wanted this originally to just be, like: you ask an AI the question; see if it answers the question.

So they did this before the challenge with publicly available chatbots. And the chatbots were able to answer two out of these 10 questions, which is impressive, but to some extent, it shows that this is a real, difficult challenge that we’re giving to the AIs.

This tiny corner of the Internet that only I pay attention to went really crazy trying to solve these problems. It shows that there’s this growing online community of, like, mathematicians and kind of math enthusiasts, who maybe aren’t research mathematicians, who are trying to use LLMs to do pure mathematics. And this community really tried their hand at these problems and produced a lot of proofs, posted on social media and Discord servers.

The First Proof team posed these questions, and they uploaded the answers in an encrypted form and told the community that they would decrypt in one week. So they gave the world a week to try to answer as many of the questions as they could. And this online community went crazy trying to do so, produced a lot of proofs. A lot of them instantly, from my reporting, were clearly garbage. Mathematicians who I talked to said, “Yeah, most of these proofs are nonsense.” But some of them had some promise.

So OpenAI initially claimed that it had solutions to six of the problems. Pretty quickly a mathematician found a problem with one of those, so it was down to five. The rest of those seem to have held up, so OpenAI seems to have gotten five correct with its unknown process. Google Gemini also released its results, and it did similarly: it got six out of 10 correct. And some of those were different ones than OpenAI did.

The active online community and some research mathematicians who were trying their hand got a couple of questions as well, questions nine and 10, which the researchers said were answerable by AI. Other people produced those answers.

There’s a few things that were striking to me about these results. One is that there was this huge discrepancy between what people with publicly available models can do and these in-house efforts of these giant companies, right? It’s a big difference to get one or two correct than to get six correct.

The other thing is that people aren’t using one LLM; they’re using what they call a “scaffold.” So they’ll have an LLM, and then they’ll have a bunch of other LLMs systematically interrogate its answer and go back and forth with it, right? This is allowed—it’s not a human in the loop—but it’s a bunch of AIs all talking to each other in some way. And it seems like this is a way to boost the performance of these LLMs. They do much better at sussing out some of the nonsense and producing a real proof.

Pierre-Louis: There was a quote in [one of the pieces] that I thought was interesting, which was that it said that when it got the correct answers that the LLMs were using almost, like, 19th-century-style math. And I was wondering about that quote and, like, what does 19th-century-style math mean.

Howlett: Yeah, this is a really important point. AI seems to, at least right now, do math a little differently and in a way that’s a little less impressive to at least some of the mathematicians. In many cases the AI will produce a proof that gets to the same conclusion as the mathematician’s proof …

Pierre-Louis: Mm-hmm.

Howlett: That decrypted that Friday, but it does it in a much more circuitous, roundabout way and with a lot of brute force, in a way that isn’t as aesthetically pleasing to mathematicians.

Mathematicians sometimes, when they describe what they’re doing, they sound more like artists than scientists, right? They really like to have what they call a “beautiful” proof, something that when you read it, you really understand why that statement at the end must be the case.

Pierre-Louis: Mm-hmm.

Howlett: And AI tends to produce these proofs where every step makes sense and you get to the end and you see the statement, so you believe it, but you don’t see the whole picture. And maybe the AI never saw the whole picture.

Pierre-Louis: Where do you think it goes from here?

Howlett: One of the researchers, Mohammed Abouzaid, said this thing about 19th-century mathematics because when mathematicians prove something, they’ll often do it by coming up with some new mathematical concept that distills the truth and is easier to work with than anything that existed before.

Pierre-Louis: Mm-hmm.

Howlett: So this is an abstract object, like a tesseract. AIs don’t seem to prefer to do that. They’re very happy to work with existing tools and just assemble them in new MacGyver-y ways, but it’s not clear that that will lead to new discoveries. A lot of times those tools that mathematicians invent along the way to a proof give them a deeper understanding of the mathematical universe and lead to more results. So at this point at least, it’s not clear if AI is capable of that kind of creative style of mathematics.

But there’s counterexamples: there’s at least one other proof on one of the servers where people are discussing these results—multiple mathematicians reviewed it, not only said it was correct but quite beautiful and it accomplished the proof in a way that they never would’ve thought of.

So it’s not clear that this is something that is always gonna be the case about AI. Maybe it just needs to keep getting better.

Pierre-Louis: That’s interesting and a little bit creepy, I think. [Laughs.]

Howlett: [Laughs.] The next round is gonna tell us a lot more. The First Proof team is working with AI companies to establish controls on the way that they do the questions.

Pierre-Louis: Mm-hmm.

Howlett: So whatever answers we get, we won’t have to take with so much of a grain of salt. And that will really tell us where the models are at and whether these in-house systems are actually much better than what’s on the public market. And also, the fact that we now have this system of iterated rounds, we can see the LLMs evolve over time.

So where does this go from here? I don’t know. There’s mathematicians who will tell you that mathematics will never be the same, that AI will be solving some of the biggest problems in mathematics in the next few years. And there’s mathematicians who I talk to who were even convinced …

Pierre-Louis: Mm-hmm.

Howlett: By this First Proof first round that timeline is going faster than they thought prior.

Pierre-Louis: What I’m hearing is that [The] Terminator was a documentary.

Howlett: [Laughs.] Yeah, about the future, I guess. Yeah.

Pierre-Louis: [Laughs.]

Howlett: There’s also plenty of mathematicians who will tell you that AI can never do what humans do about math, which is direct curiosity in new directions, and that the best it can ever be is a tool mathematicians use, just like a calculator.

I have trouble not being bummed out when I imagine a future where AI is solving the big problems in math—like, isn’t part of the excitement that humans solve the problems? But multiple mathematicians have pushed back on that.

Pierre-Louis: Mm-hmm.

Howlett: They’ll say, no, they just wanna know things about the mathematical universe. They don’t care whether an AI tells them or they do.

One mathematician used this example, this thought experiment from a [Jorge Luis] Borges story, “The Library of Babel.” So he’s saying, “Imagine a world where we could just have access to any mathematical truth—we had a giant library that contained all the proofs you could ever have.” And his point was that any mathematician he knows would be ecstatic to be in that library and would get right to work trying to understand things. The point is that the job of a mathematician isn’t going anywhere; it’s maybe an exciting time for mathematicians.

For me it’s hard imagining a future where I won’t have the human side of the story. Definitely, like, reporting on a big math proof …

Pierre-Louis: Mm-hmm.

Howlett: Will be less exciting if I don’t hear about the person who was stuck late at night at her desk, like, struggling through a problem, beating her head against the ground until she came up with that, like, moment of illumination. And also collaboration, like, the stories of mathematicians meeting up at conferences and having that key discussion over coffee that leads to, like, a fundamental breakthrough. So I hope humans stay in the loop. [Laughs.]

Pierre-Louis: I do, too, for what it’s worth.

Howlett: [Laughs.]

Pierre-Louis: Thank you so much for taking the time to speak with us today.

Howlett: Thanks so much for having me, Kendra.

Pierre-Louis: That’s it for today! See you on Friday, when we explore the science of pain.

Science Quickly is produced by me, Kendra Pierre-Louis, along with Fonda Mwangi, Sushmita Pathak and Jeff DelViscio. This episode was edited by Alex Sugiura. Shayna Posses and Aaron Shattuck fact-check our show. Our theme music was composed by Dominic Smith. Subscribe to Scientific American for more up-to-date and in-depth science news.

For Scientific American, this is Kendra Pierre-Louis. See you next time!

Source link

Diverse Daily

Leave a Reply Cancel reply

Recent Comments

Categories