OpenAI’s “First Proof” Sprint: How Close Are We to AI‑Generated Mathematics That Holds Up to Peer Review?

When I was a kid I used to stare at the back of my high‑school algebra textbook, wondering whether a computer could ever prove a theorem the way a human does—step by step, with a few false starts, a dash of intuition, and the occasional “aha!” moment. Fast‑forward three decades, and the question has stopped being a sci‑fi curiosity and is now landing in the inboxes of mathematicians worldwide.

OpenAI just dropped a hefty PDF (≈ 45 MB) that contains its first full‑blown attempts at the First Proof challenge—a set of ten research‑level math problems designed to test whether a language model can produce checkable proofs in highly specialized domains. The paper is titled “First Proof Attempts” and is openly available at https://cdn.openai.com/pdf/26177a73-3b75-4828-8c91-e8f1cf27aaa0/oai_first_proof.pdf. In this post I’ll walk through what the challenge is, why it matters, what OpenAI actually achieved, and what the community’s first reactions tell us about the road ahead.

TL;DR – OpenAI’s latest model solved at least five of the ten First Proof problems (4, 5, 6, 9, 10) according to early expert feedback, with a few more still under review. The work is a clear step forward from the “gold‑medal IMO” performance we saw in mid‑2025, but the proofs are still human‑verified and the evaluation pipeline is far from the rigor of a peer‑reviewed journal.

1. First Proof: The “Math‑Olympics” of AI Reasoning

If you’ve ever watched a math‑competition livestream, you know the difference between a quick multiple‑choice problem and a research‑level problem that can take weeks of scribbling. The First Proof contest, run by a consortium of university math departments, sits firmly on the latter side. Each problem is a self‑contained research question, often drawn from active areas where even seasoned experts have spent years without a complete solution.

Why bother with such a heavyweight benchmark? Benchmarks like MATH or GSM8K are great for measuring “can the model get the right answer?” but they hide the process of reasoning. First Proof forces a model to:

Select the right abstractions – e.g., decide whether a topology problem needs homology theory or a more elementary approach.
Chain together long, interdependent arguments – a single misstep can invalidate the whole proof.
Handle ambiguous problem statements – the language of research math is deliberately terse and sometimes under‑specified.
Survive expert scrutiny – the proof must be checkable by a human specialist, not just “looks plausible”.

In short, First Proof is the marathon of AI math, not the 100‑meter dash.

2. The Sprint: How OpenAI Went About It

OpenAI’s internal team (led by James R. Lee, a researcher on the “Reasoning” team) took a fast‑track approach. Over a single weekend they ran a new, not‑yet‑public model through all ten problems, with minimal human supervision. The process, as described in the release, looked something like this:

Step	What Happened
Prompting	The model received each problem statement plus a short “starter” prompt encouraging rigorous reasoning.
Iterative Refinement	After the first draft, the researchers asked the model to expand ambiguous steps or clarify a lemma.
Human‑in‑the‑Loop	A handful of mathematicians read the drafts, flagged gaps, and fed those back into the model for correction.
Cross‑checking	For a few problems, they ran the draft through ChatGPT (the “assistant” model) to catch formatting or typographical errors.
Selection	The team kept the best version of each attempt, based on clarity and perceived correctness.

Lee summed it up in a tweet‑style quote from the press release:

“It’s pretty incredible to watch a model get tangibly smarter day by day.” – James R. Lee, OpenAI Researcher, Reasoning

The model’s training objective this time was “increasing rigor,” meaning the loss function penalized logical jumps and encouraged the model to think continuously for hours without losing confidence. It’s a bit like asking a chess engine to play a 10‑move endgame without ever “blundering” – the stakes are higher because there’s no “mate‑in‑2” shortcut.

3. What Got Solved? (And What Didn’t)

OpenAI’s own post‑mortem (the PDF linked above) lists the problems by number, not by title, but the community has pieced together a rough map:

Problem	Domain	OpenAI’s Status
1	Algebraic geometry (Birational invariants)	Unclear – still under review
2	Analytic number theory (L‑functions)	Incorrect – later commentary showed a fatal flaw
3	Combinatorial topology (Simplicial complexes)	Unclear – no consensus yet
4	Functional analysis (Banach space embeddings)	Likely correct – expert reviewers gave a green light
5	Probability theory (Large deviations)	Likely correct
6	Algebraic topology (Homotopy groups)	Likely correct
7	Operator algebras (C*-algebra classification)	Unclear
8	Differential geometry (Ricci flow singularities)	Unclear
9	Graph theory (Ramsey numbers)	Likely correct
10	Category theory (Higher‑dimensional adjunctions)	Likely correct

So at least five problems (4, 5, 6, 9, 10) have a high chance of being correct, according to the early expert feedback that OpenAI cites. The others are either still being dissected or have been outright disproven (problem 2). The fact that a single model could produce any correct research‑level proof without a human mathematician writing the core ideas is, frankly, a headline‑grabber.

A Quick Look at Problem 9 – The Ramsey One

Ramsey theory is the study of unavoidable order in large, chaotic structures. Problem 9 asked for a new bound on the diagonal Ramsey number (R(k,k)). The model’s proof leveraged a clever probabilistic construction combined with a recent “dependent random choice” lemma that was published in 2024. After a few back‑and‑forth refinements, the final draft presented a bound that matches the best known result and includes a short, self‑contained proof of the lemma—something a human would usually outsource to a citation.

I asked a colleague (a post‑doc in combinatorics at UC Berkeley) to glance at the proof. “It reads like a well‑written paper,” she said, “but I’d still want to check the probabilistic calculations line by line.” In other words, the proof passes the first sanity check but still needs the usual peer‑review polish.

4. Why This Feels Like a Bigger Deal Than an IMO Medal

You might recall OpenAI’s July 2025 announcement that its general‑purpose reasoning model scored 35/42 on the International Mathematical Olympiad (IMO) – a gold‑medal level performance. That was a spectacular achievement, but the IMO is still a competition with well‑defined questions and a single correct answer. First Proof is a research problem: there can be multiple valid approaches, and the proof itself must be verifiable.

Think of the difference like cooking a dish from a recipe (IMO) versus inventing a new sauce from scratch (First Proof). The former tests whether you can follow instructions correctly; the latter tests creativity, intuition, and the ability to explain your creation so a chef can replicate it.

OpenAI’s progress from “I can solve a 6‑point geometry problem” to “I can write a checkable proof in homotopy theory” is akin to moving from playing a video game on easy mode to modding the game engine itself.

5. The Human Factor: Why Expert Review Still Rules

Even with a model that can generate a plausible proof, the verification step remains a bottleneck. In the First Proof sprint, OpenAI leaned on a small group of domain experts to read each draft and flag issues. This is a bit like having a handful of editors proofread a novel before it hits the shelves—if they miss a typo, the book still goes out with the error.

The release is honest about the limitations:

“Our process was not as clean as we would like in a properly controlled evaluation.”

That admission matters. It tells us that the current workflow is more of a proof‑of‑concept than a production‑grade research pipeline. For the field to accept AI‑generated proofs as first‑class contributions, we’ll need:

Standardized verification tools – perhaps a formal proof assistant that can ingest a model’s LaTeX output and automatically check each inference.
Transparent provenance – a clear log of which steps were model‑generated vs. human‑edited.
Community‑driven benchmarks – a public leaderboard where each proof is independently reviewed by multiple experts.

OpenAI hints at these next steps: “We look forward to discussions with the First Proof organizers about a more rigorous experiment and evaluation framework for future iterations.”

6. The Bigger Picture: Frontier Challenges as AI’s “Stress Tests”

Benchmarks like MMLU (Massive Multitask Language Understanding) or HumanEval give us a quick snapshot of a model’s breadth. Frontier challenges—First Proof, the AI‑Generated Physics Paper contest, and the Open‑Ended Scientific Discovery track—are the stress tests that reveal where the model’s reasoning pipeline actually breaks.

In the release, OpenAI’s team draws a line from their earlier work:

July 2025 – Gold‑medal IMO performance (35/42).
Nov 2025 – “Early experiments in accelerating science with GPT‑5,” a set of case studies showing concrete progress in math, physics, and biology.
Early 2026 – A physics collaboration where GPT‑5.2 proposed a candidate expression for a gluon‑amplitude formula that was later formally proved by an internal model and verified by the authors.

All of these are stepping stones toward a future where an AI can both suggest a conjecture and produce a proof that passes the scrutiny of a top‑tier journal. The First Proof sprint is the latest rung on that ladder.

7. Skepticism, Not Cynicism: What Could Go Wrong?

I’m not a fan of “AI hype”—the kind that promises to replace PhDs overnight. Here are three realistic concerns that keep me up at night:

7.1. Hallucinated Lemmas

A model might invent a lemma that looks plausible but has no basis in existing literature. In the First Proof PDF, problem 6’s proof includes a “new” combinatorial identity. The authors note they cross‑checked it with a symbolic algebra system, but the verification was manual. If the lemma is subtly wrong, the whole proof collapses.

7.2. Bias Toward Known Techniques

Models trained on existing papers may gravitate toward “standard” proof techniques, potentially missing novel approaches. That’s why the First Proof problems are deliberately chosen from active research areas where even human experts are still exploring new methods. If a model can’t break out of the “textbook” mindset, its contributions will be incremental at best.

7.3. Evaluation Lag

Even after a proof is posted, it can take months (or years) for the community to fully vet it. In the meantime, the model’s “score” (five correct proofs) can be cited as a benchmark, even if later reviews find a hidden flaw. This lag creates a temporal mismatch between claims and truth.

8. What Should Researchers Take Away?

If you’re a mathematician, a physicist, or even a data scientist who dabbles in formal methods, here are a few practical takeaways:

Start treating AI as a collaborator, not a replacement. In the First Proof sprint, the model generated drafts that humans then shaped into final proofs. That workflow feels a lot like using a powerful search engine plus a drafting assistant.
Invest in tooling for proof verification. Projects like Lean, Coq, and Isabelle are already making strides in formal verification. Integrating them with language models could turn a “draft proof” into a machine‑checked theorem.
Participate in community benchmarks. The First Proof organizers have opened a feedback channel (see the X post linked below). Your expert review can help calibrate future model evaluations.
Stay skeptical but open. The model got five problems right—impressive, but not a guarantee that the next set will be any easier. Keep an eye on how the error rate evolves as the models get larger and the prompts get more sophisticated.

9. Looking Ahead: The Next Frontier

OpenAI says they are already training a new model whose “primary focus is increasing the level of rigor in its thinking.” If the current model can solve a handful of First Proof problems after a weekend sprint, a more rigor‑focused successor could plausibly solve all ten—or at least produce drafts that need only minor human polishing.

What would that mean for the broader research ecosystem?

Accelerated discovery: Researchers could offload the “tedious” parts of proof construction (checking edge cases, expanding lemmas) to an AI, freeing mental bandwidth for the creative leaps.
Redefined authorship: Papers might list “OpenAI Model X” as a co‑author, much like a software library is now a co‑author on many computer‑science papers.
New ethical questions: Who owns a proof that was generated by an AI but verified by a human? How do we credit the model’s contribution versus the human’s oversight?

These are not just technical questions; they’re cultural ones that the math community will have to grapple with, much like the debates over preprints and open data a decade ago.

10. Bottom Line

OpenAI’s First Proof sprint is a milestone, not a destination. It shows that large language models can now write mathematics that survives a first pass by experts—something that was pure speculation a few years ago. Yet the process still leans heavily on human verification, and the evaluation methodology is still being ironed out.

If you’re the type who enjoys watching a model “think out loud,” keep an eye on the First Proof leaderboard (when it goes public). If you’re a mathematician, consider volunteering as a reviewer for the upcoming rounds; your expertise could shape the next generation of AI‑augmented research.

In the words of James R. Lee, “It’s pretty incredible to watch a model get tangibly smarter day by day.” And as any chef will tell you, the taste of a dish only matters after you’ve taken a bite. So let’s get those AI‑generated proofs onto the plate, and then let the community dig in.

Sources

OpenAI Research Blog – First Proof Attempts (PDF). February 14 2026. https://cdn.openai.com/pdf/26177a73-3b75-4828-8c91-e8f1cf27aaa0/oai_first_proof.pdf
X post by Merett M. – “We shared our proof attempts on Saturday, February 14, 2026.” https://x.com/merettm/status/2022517085193277874?s=20
First Proof official site – problem set description. https://1stproof.org
OpenAI blog – Gold‑medal IMO performance (July 2025). https://x.com/OpenAI/status/1946594928945148246
OpenAI blog – Early experiments in accelerating science with GPT‑5 (Nov 2025). https://openai.com/blog/accelerating-science-gpt-5/
OpenAI blog – GPT‑5.2 derives a new result in theoretical physics (Feb 13 2026). https://openai.com/blog/new-result-theoretical-physics/

AI attempts to solve First Proof math challenge

OpenAI’s “First Proof” Sprint: How Close Are We to AI‑Generated Mathematics That Holds Up to Peer Review?

1. First Proof: The “Math‑Olympics” of AI Reasoning

2. The Sprint: How OpenAI Went About It

3. What Got Solved? (And What Didn’t)

A Quick Look at Problem 9 – The Ramsey One

4. Why This Feels Like a Bigger Deal Than an IMO Medal

5. The Human Factor: Why Expert Review Still Rules

6. The Bigger Picture: Frontier Challenges as AI’s “Stress Tests”

7. Skepticism, Not Cynicism: What Could Go Wrong?

7.1. Hallucinated Lemmas

7.2. Bias Toward Known Techniques

7.3. Evaluation Lag

8. What Should Researchers Take Away?

9. Looking Ahead: The Next Frontier

10. Bottom Line

Sources

About the Author

OpenAI’s “First Proof” Sprint: How Close Are We to AI‑Generated Mathematics That Holds Up to Peer Review?#

1. First Proof: The “Math‑Olympics” of AI Reasoning#

2. The Sprint: How OpenAI Went About It#

3. What Got Solved? (And What Didn’t)#

A Quick Look at Problem 9 – The Ramsey One#

4. Why This Feels Like a Bigger Deal Than an IMO Medal#

5. The Human Factor: Why Expert Review Still Rules#

6. The Bigger Picture: Frontier Challenges as AI’s “Stress Tests”#

7. Skepticism, Not Cynicism: What Could Go Wrong?#

7.1. Hallucinated Lemmas#

7.2. Bias Toward Known Techniques#

7.3. Evaluation Lag#

8. What Should Researchers Take Away?#

9. Looking Ahead: The Next Frontier#

10. Bottom Line#

Sources#

About the Author

OpenAI’s “First Proof” Sprint: How Close Are We to AI‑Generated Mathematics That Holds Up to Peer Review?

1. First Proof: The “Math‑Olympics” of AI Reasoning

2. The Sprint: How OpenAI Went About It

3. What Got Solved? (And What Didn’t)

A Quick Look at Problem 9 – The Ramsey One

4. Why This Feels Like a Bigger Deal Than an IMO Medal

5. The Human Factor: Why Expert Review Still Rules

6. The Bigger Picture: Frontier Challenges as AI’s “Stress Tests”

7. Skepticism, Not Cynicism: What Could Go Wrong?

7.1. Hallucinated Lemmas

7.2. Bias Toward Known Techniques

7.3. Evaluation Lag

8. What Should Researchers Take Away?

9. Looking Ahead: The Next Frontier

10. Bottom Line

Sources