Prompt Inversion > Blog > Evaluating GPT o1-pro with PhD Math

April 14, 2025

Evaluating GPT o1-pro with PhD Math

We give several examples of GPT o1-pro solving complex mathematical tasks and demonstrating PhD-level ability.

In this post, we will give a variety of examples of mathematical problems and examine o1-pro performance on each. I would rate the overall mathematical ability as similar to an early-to-mid stage PhD student. We see that o1-pro still struggles with more abstract or complex problems and makes errors when given no guidance. However, when given a rough proof strategy or feedback regarding wrong approaches, it's able to perform useful mathematical tasks. 

Throughout this post, I would like to again draw a distinction between calculation ability and intuition. Calculation ability refers to the ability of a model to solve a given equation or word problem. Intuition is a much fuzzier concept regarding whether a model can actually find the correct proof strategy, argue that some proof strategies would be easier or more natural than others, and understand which mathematical questions are interesting to study. The hallmark of a research mathematician is that they've built good intuition for problems, after which calculation ability is a secondary concern. 

First, we will tackle calculation ability. Overall, the calculations are error-prone and have a tendency towards subtle errors which must be carefully double-checked. 

Is 9.9 greater than 9.11? This depends on what you’re comparing: As decimal numbers: 9.9 is equivalent to 9.90 9.11 is equivalent to 9.110 When comparing 9.90 and 9.110 numerically, 9.90 is greater because 0.90>0.110. So yes, as pure decimals, 9.9 is larger than 9.11 As version numbers (or dotted notation): Often in software or other “versioned” contexts, “9.11” is read as “major version 9, minor version 11,” while “9.9” is “major version 9, minor version 9.” In that case, 9.11 would be a later (and thus ‘greater’) version than 9.9, because 11 > 9 in the minor version. Short answer: If you are simply looking at them as decimal values, 9.9 is larger than 9.11. If you’re talking about software (or similar) version labels, then version 9.11 typically follows (and is considered “higher than”) version 9.9.

This is a famous prompt which has caused errors across multiple GPT versions. After thinking for over 3 minutes, o1-pro outputs a “maybe” answer, which is still an improvement. If o1-pro was not given sufficient context, there is a high likelihood that the model would output that 9.9 is less than 9.11, as happened with previous GPT versions. A simple arithmetic error like this would render any subsequent calculations in a mathematical proof completely incorrect. 

However, more complex calculations can sometimes be performed perfectly. We’ve included the chain of thought below, which leads to the correct answer of \( (2.5)^10 = 9536.743… \)

Roll a biased dice which lands on 1 with probability 0.5 and 2,3,4,5,6 each with probability 0.1. Then multiply the value of 10 biased dice rolls. What is the expected value? Assessing dice rolls: I’m working through 10 biased dice rolls, each with P(X=i) values. Assuming they roll independently, I’m calculating the expected product of the 10 rolls. The formula, \( E(\prod X_i) = \prod E(X_i) \), applies to independent positive variables. Confirming formula: OK, let's see. The formula \( E(\prod X_i) = \prod E(X_i) \) works for i.i.d. positive variables. Calculated \( E(X) \) as 2.5 based on die distribution, then applied [E(X)]^10 for the expected product.

Now let’s consider intuition. Many undergraduate students see questions about Fourier and Laplace transforms, which are useful in the solution of some differential equations. However, classes never teach the underlying intuition on why they are useful and in what cases you would use a Fourier instead of a Laplace transform. This is an area where o1-pro excels, and can provide intuition which many students do not have. 

What is the intuition behind Fourier and Laplace transforms? What types of situations should I apply them in? When would I use a Fourier transform instead of a Laplace transform?  Fourier Transform: Main Ideas and Intuition Decomposing Signals into “Pure Frequencies” The Fourier transform takes a signal \( x(t) \)(often defined for all real time \( t\in (-\infty,\infty) \)and decomposes it into a sum (or integral) of sinusoids (sine and cosine waves). Equivalently, you can think of it as representing \( x(t) \)in terms of complex exponentials \( e^{i\omega t} \). In simpler terms: 1. Fourier transform answers the question: “What frequencies (and with what strengths) does my signal contain?” 2. It is most natural when your signal is stationary or spans (at least in principle) all time. Laplace Transform: Main Ideas and Intuition Handling Growth, Decay, and Causality The Laplace transform also uses exponentials, but crucially they are of the form \( e^{-st} \)with \( s=\sigma+i\omega \), and it usually applies for \( t\geq 0 \). This makes it well suited for causal systems (systems that “turn on” at \( t=0 \)and evolve thereafter). 1. Laplace transform answers the question: “How does my function (often a transient or causal signal) decompose into modes of exponential growth or decay and oscillations?” 2. It is geared toward analyzing initial value problems—how systems respond to inputs and initial conditions after time zero. When to Use Fourier vs. Laplace? 1. Fourier Transform: * You want to see the frequency spectrum of a signal over the whole real line. * Your primary concern is periodic or stationary behavior, or you are interested in the oscillatory/steady-state aspects. * You don’t particularly care about the start-up transient of the signal or initial/boundary conditions in time. 2. Laplace Transform: * You have a system that “turns on” at \( t=0\), or you are analyzing a causal system with known initial conditions. * You want to solve initial value problems for ODEs or PDEs quickly and systematically. * You want to analyze the stability of a system (poles in the right half of the s-plane often indicate instability).

If we probe o1-pro further, it can expand on any of these points with useful examples and deeper dives. Even after years of working with Fourier transforms, I learned several interesting facts writing this post.

For many graduate level problems, o1-pro has built experienced intuition, even when it struggles to perform basic calculations. This can lead to NLP practitioners systematically underestimating its capabilities, since evaluation benchmarks are based on correctness of the final answer and disregard the underlying intuition, which is difficult to quantify. 

Recent blog posts

LLMs

Evaluating GPT o1-pro with PhD Math

We give several examples of GPT o1-pro solving complex mathematical tasks and demonstrating PhD-level ability.

April 14, 2025
Tanay Wakhare
Read more
LLMs

GPT o1-pro: a Math PhD in Your Pocket?

We'll explore the capabilities of o1-pro, OpenAI's $200/month offering.

April 7, 2025
Tanay Wakhare
Read more
LLMs

Detecting AI-Generated Content: How to Spot the Bots 

We explore various methods to detect whether text is generated by LLMs.

March 31, 2025
Albert Chen
Read more