We'll explore the capabilities of o1-pro, OpenAI's $200/month offering.
Is $200 a month for the premium model GPT o1-pro worth it? For many standard use cases, no. But for mathematics, the answer is a resounding yes. In fact, Sam Altman tweeted that OpenAI is losing money on o1-pro subscriptions. The people who use it tend to be power users who use it a lot more than OpenAI expected. These models are approaching the levels of math PhDs and can provide legitimate research assistance.
Throughout, we’ll refer to RLMs (Reasoning Language Models). These are large language models (LLMs) which have been specially trained in order to provide accurate reasoning for quantitative problems such as mathematics. Currently, the leading RLMs are GPT o1-pro and DeepSeek-R1.
The first area of mathematics where AI achieved superhuman performance was the AlphaGeometry model. Geometry starts with a very concrete and small set of rules and then builds up more complex geometric figures which we need to prove things about. However, due to the simplicity of the underlying rule set and the somewhat formulaic nature of many of the proofs, this turned out to be a great area to apply AI techniques.
Another area where AI seems to be improving rapidly is combinatorics. In general, combinatorics or graph theory considers simple objects which we can count. There are very concrete problems. For example, if asked to count the number of graphs on 54 vertices which have no isolated vertices, we can draw out examples of the first few small graphs and try to discern a pattern.
As a rule of thumb, the more mathematical abstraction and background that a problem requires, the worse RLMs perform. It's best to give RLMs problems where all of the necessary definitions can be specified without assuming background knowledge. Combinatorics is an ideal test case since the objects under study are very concrete and easy to define and don't need additional abstract mathematics to understand. Combinatorics questions are usually very easy to state. RLMs also perform better when given formatted LaTeX equations, which is a standardized way to write math equations.
One controversial stance is that the NLP community is systematically underestimating the mathematical capabilities of RLMs. The reason is that many evaluation benchmarks are focused on quantitative measures of accuracy. With math problems, there is only a single correct answer, which can be easily verified. However, as a research mathematician, calculation ability is less important than intuition. This is because much of mathematical research involves figuring out what the final equations to prove are, and what proof strategy to use. If o1 is able to provide these proof strategies, and a human has to fill in the details, it still significantly streamlines the research workflow.
Expanding on the distinction between calculation ability and intuition in mathematics. Calculation ability refers to the ability to solve an equation or a mathematical word problem. Intuition, on the other hand, is a much fuzzier concept. It can mean knowledge of the overall proof strategy heuristics that let us figure out what proof strategies don’t work, or knowledge of what problems are interesting to study. Many times as a researcher, knowing that a calculation will work out before performing it is based on intuition.