

In the past three years since the remarkable launch of ChatGPT, OpenAI’s innovations have transformed a wide array of daily tasks at home, in workplaces, and in educational settings—essentially anywhere a browser is open or a phone is in hand, which is practically everywhere.
Now, OpenAI is actively courting the scientific community. In October, the organization revealed the formation of a new division, OpenAI for Science, specifically focused on investigating how its extensive language models could assist researchers and modifying its resources to enhance their support.
The last couple of months have seen a surge of social media posts and research articles where mathematicians, physicists, biologists, and others have shared how LLMs (particularly OpenAI’s GPT-5) have facilitated their breakthroughs or prompted them toward solutions they might otherwise have overlooked. OpenAI for Science was, in part, established to interact with this group.
However, OpenAI is somewhat tardy to this initiative. Google DeepMind, its competitor known for pioneering scientific models like AlphaFold and AlphaEvolve, has maintained an AI-for-science team for several years. (During a conversation with Google DeepMind’s CEO and co-founder Demis Hassabis in 2023 regarding that team, he remarked: “This is the reason I started DeepMind … It’s essentially why I’ve dedicated my whole career to AI.”)
So why the interest now? How does this scientific endeavor align with OpenAI’s broader goals? What precisely does the firm aim to accomplish?
I posed these inquiries to Kevin Weil, a vice president at OpenAI who heads the new OpenAI for Science initiative, during a unique interview last week.
On mission
Weil is a product-oriented individual. He joined OpenAI a couple of years ago as the chief product officer after serving as the head of product at Twitter and Instagram. However, he originally started as a scientist. He progressed two-thirds of the way through a PhD in particle physics at Stanford University before abandoning academia for the allure of Silicon Valley. Weil emphasizes his background: “I thought I would be a physics professor for life,” he comments. “I still enjoy reading math texts on my vacations.”
When asked how OpenAI for Science integrates with the company’s current array of productivity tools or the trending video application Sora, Weil repeats the company’s guiding principle: “OpenAI’s goal is to develop artificial general intelligence and ensure it benefits all of humanity.”
Envision the potential future that this technology could achieve in science, he suggests: new medications, innovative materials, new devices. “Consider how it could aid us in grasping the essence of reality, pushing us to address unresolved challenges. Perhaps the most significant, most beneficial impact we’ll observe from AGI will actually stem from its capacity to enhance scientific progress.”
He adds: “With GPT-5, we began to see this potential.”
According to Weil, LLMs are now sufficiently advanced to act as useful partners in scientific endeavors. They can propose ideas, recommend new lines of inquiry, and draw helpful connections between contemporary issues and historical solutions documented in obscure journals from decades past or in different languages.
That wasn’t the case a year or so back. Following the introduction of its first so-called reasoning model—a variant of LLM that can deconstruct issues into multiple stages and address them sequentially—in December 2024, OpenAI has been stretching the boundaries of the technology’s capabilities. Reasoning models have significantly enhanced LLMs’ ability to tackle mathematical and logical challenges compared to prior iterations. “Reflecting a few years ago, we were all astonished when these models achieved an 800 on the SAT,” Weil states.
However, soon LLMs were excelling in mathematical competitions and resolving advanced physics problems. Last year, both OpenAI and Google DeepMind declared that their LLMs scored gold-medal-level results in the International Math Olympiad, one of the most challenging math competitions globally. “These models are no longer just superior to 90% of graduate students,” notes Weil. “They are genuinely at the cutting edge of human capabilities.”
This assertion is significant, and it is accompanied by caveats. Nonetheless, there is no denying that GPT-5, which includes a reasoning model, represents a marked improvement over GPT-4 regarding intricate problem-solving. Evaluated against an industry standard called GPQA, which comprises over 400 multiple-choice queries testing PhD-level comprehension in biology, physics, and chemistry, GPT-4 receives a score of 39%, considerably lower than the human-expert baseline of around 70%. As per OpenAI, GPT-5.2 (the most recent version of the model, released in December) achieves a score of 92%.
Overhyped
The enthusiasm is palpable—and perhaps overstated. In October, high-ranking individuals at OpenAI, including Weil, asserted on X that GPT-5 had discovered solutions to various unsolved mathematical issues. Mathematicians quickly pointed out that what GPT-5 appeared to have accomplished was retrieving known solutions from outdated research papers, including at least one authored in German. While that was still useful, it didn’t reflect the achievement OpenAI suggested they had made. Weil and his colleagues subsequently removed their posts.
Now, Weil is more cautious. It is often sufficient to uncover answers that have been overlooked, he remarks: “We collectively benefit from the achievements of giants, and if LLMs can accumulate that knowledge so we don’t waste time grappling with a challenge that has already been resolved, that’s an acceleration in itself.”
He downplays the notion that LLMs are poised to generate groundbreaking new discoveries. “I don’t think the models have reached that point yet,” he remarks. “They might reach it. I am hopeful they will.”
Yet, he emphasizes, that is not the aim: “Our goal is to speed up scientific progress. And I don’t believe the threshold for accelerating science requires an Einstein-level reimagining of an entire discipline.”
For Weil, the inquiry is: “Does science genuinely progress more rapidly because scientists and models can collaborate more effectively and efficiently than scientists alone? I believe we are already witnessing that.”
In November, OpenAI released a collection of anecdotal case studies contributed by scientists, both within and external to the organization, showcasing how they utilized GPT-5 and how it assisted them. “Most of the instances involved scientists who were already employing GPT-5 directly in their research and approached us in one way or another claiming, ‘Look at what I can accomplish with these resources,’” states Weil.
The primary abilities that GPT-5 seems to exhibit include locating references and establishing connections to existing literature that researchers were previously unaware of, occasionally igniting new concepts; aiding scientists in outlining mathematical proofs; and proposing methods for testing hypotheses in laboratory environments.
“GPT-5.2 has analyzed nearly every academic paper published over the last 30 years,” Weil remarks. “It comprehends not only the specific area a researcher is investigating; it can integrate analogies from unrelated fields.”
“That’s incredibly impactful,” he continues. “While it’s always possible to find a human counterpart in a related field, sourcing a thousand collaborators across all potentially relevant disciplines can be a challenge. Additionally, I can work with the model late at night—it doesn’t require rest—and I can inquire about 10 topics simultaneously, which can be awkward when dealing with a human.”
Solving problems
Most scientists OpenAI consulted support Weil’s assertion.
Robert Scherrer, a physics and astronomy professor at Vanderbilt University, initially experimented with ChatGPT for amusement (“I had it rewrite the theme for Gilligan’s Island in the style of Beowulf, which it accomplished effectively,” he shares) until his colleague Alex Lupsasca, a physicist now at OpenAI, informed him that GPT-5 had aided in solving a challenge he had been grappling with.
Lupsasca provided Scherrer access to GPT-5 Pro, OpenAI’s premium subscription priced at $200 monthly. “It resolved a problem that my graduate student and I were unable to tackle despite working on it for several months,” Scherrer acknowledges.
It’s not flawless, he admits: “GPT-5 still makes silly errors. Certainly, I do as well, but the mistakes GPT-5 produces can be even more ridiculous.” However, it continues to improve, he believes: “If current patterns persist—and that’s a significant if—I predict that all scientists will soon incorporate LLMs into their workflows.”
Derya Unutmaz, a biology professor at the Jackson Laboratory, a nonprofit research institution, employs GPT-5 to ideate, summarize research papers, and plan experiments related to his investigation of the immune system. In the case study he provided to OpenAI, Unutmaz utilized GPT-5 to analyze an old data set that his team had previously examined. The model offered new insights and interpretations.
“LLMs have become indispensable for scientists,” he remarks. “When you can complete data set analyses that previously took months, not utilizing them is no longer a plausible option.”
Nikita Zhivotovskiy, a statistician at the University of California, Berkeley, states he has been integrating LLMs into his research since the inaugural version of ChatGPT was launched.
Similar to Scherrer, he finds LLMs most beneficial when they reveal unexpected links between his own research and earlier findings he was unaware of. “I perceive LLMs evolving into an essential technical asset for researchers, much like computers and the internet did previously,” he asserts. “Those who do not utilize them will likely face a long-term disadvantage.”
Nevertheless, he does not anticipate LLMs to yield groundbreaking discoveries in the near future. “I have encountered very few truly novel ideas or arguments worthy of publication on their own,” he states. “Thus far, they predominantly seem to amalgamate existing findings, sometimes inaccurately, rather than introduce genuinely innovative strategies.”
I also reached out to several scientists not affiliated with OpenAI.
Andy Cooper, a chemistry professor at the University of Liverpool and director of the Leverhulme Research Centre for Functional Materials Design, is less optimistic. “We have yet to discover that LLMs fundamentally alter the process of conducting science,” he mentions. “Nonetheless, our recent findings suggest they do possess a role.”
Cooper is spearheading a project aimed at creating an AI scientist capable of completely automating segments of the scientific workflow. He indicates that his team doesn’t rely on LLMs for ideation. However, the technology is beginning to demonstrate utility as part of a broader automated system where an LLM can assist in directing robotic processes, for instance.
“I speculate that LLMs might find more initial application within robotic frameworks, primarily due to uncertainty regarding people’s readiness to be guided by an LLM,” Cooper states. “I certainly am not.”
Making errors
While LLMs are increasing in usefulness, caution remains paramount. In December, Jonathan Oppenheim, a scientist specializing in quantum mechanics, highlighted a mistake that found its way into a scientific journal. “OpenAI leadership are promoting a paper in Physics Letters B where GPT-5 proposed the main idea—potentially the first peer-reviewed paper to have an LLM as its core contributor,” Oppenheim tweeted. “One minor issue: GPT-5’s idea tests the wrong hypothesis.”
He elaborated: “GPT-5 was asked to propose a test that identifies nonlinear theories. Instead, it offered a test that identifies nonlocal theories. They’re somewhat related, but different. It’s akin to requesting a COVID test and receiving a chickenpox test instead.”
It is evident that numerous researchers are discovering innovative and intuitive methods to engage with LLMs. It is also apparent that the technology can make errors that may be subtle enough for even seasoned professionals to overlook.
Part of the challenge is that ChatGPT can entice individuals into letting their guard down. As Oppenheim remarked: “A fundamental concern is that LLMs are being trained to validate the user, while scientific inquiries require tools that challenge our thinking.” In a striking case, an individual (who was not a scientist) was led by ChatGPT to believe for months that he had created a new branch of mathematics.
Of course, Weil is acutely aware of the hallucination issue. However, he asserts that newer models are experiencing fewer hallucinations. Nevertheless, he believes that an emphasis on hallucination might be missing the larger picture.
“One of my colleagues here, a former math professor, mentioned something that resonated with me,” Weil states. “He noted: ‘When I’m conducting research, if I’m bouncing concepts off a partner, I’m incorrect 90% of the time and that’s part of the process. We’re both brainstorming and striving to discover something that works.’”
“That’s actually a desirable state to be in,” states Weil. “If you propose enough incorrect ideas and someone stumbles upon a nugget of truth, then the other individual can capitalize on it and suggest, ‘Oh, that’s not quite accurate, but what if we—’ You gradually carve a path through the wilderness.”
This encapsulates Weil’s vision for OpenAI for Science. GPT-5 is competent, yet it is not infallible. The technology’s true worth lies in its capacity to guide individuals toward novel paths, rather than offering conclusive solutions, he argues.
In fact, one of the aspects OpenAI is currently examining is adjusting GPT-5 to modulate its confidence level when providing responses. Rather than stating Here’s the answer, it might suggest to researchers: Here’s something to explore.
“That’s something we’re investing significant time in,” Weil mentions. “We are striving to ensure that the model embodies a type of epistemological humility.”
Watching the watchers
Another area OpenAI is investigating is using GPT-5 to verify its own outputs. Frequently, if you feed one of GPT-5’s responses back into the model, it will critically analyze it and highlight inaccuracies.
“You can essentially connect the model to act as its own evaluator,” Weil explains. “Then you can create a workflow where the model is reasoning, and subsequently, it is reviewed by another model, which can identify areas for improvement, passing that feedback to the original model and saying, ‘Hey, wait—this section was incorrect, but this part was intriguing. Retain it.’ It’s almost like a pair of agents collaborating, and you only observe the outcome once it has been vetted by the critic.”
What Weil describes also greatly resembles Google DeepMind’s approach with AlphaEvolve, which enveloped its LLM, Gemini, within a comprehensive system that filtered out quality responses from subpar ones and integrated them back for enhancement. Google DeepMind has successfully utilized AlphaEvolve to resolve numerous real-world challenges.
OpenAI confronts substantial competition from rival organizations, whose own LLMs can perform most, if not all, of the tasks it claims for its models. If that holds true, why should researchers choose GPT-5 over Gemini or Anthropic’s Claude, competing families of models that continue to advance each year? Ultimately, OpenAI for Science could serve as much to assert a presence in a new domain as anything else. The most significant breakthroughs are still on the horizon.
“I predict that 2026 will represent for scientific research what 2025 signified for software engineering,” Weil asserts. “At the dawn of 2025, those using AI for most of their coding were pioneers. In contrast, a mere year later, if you aren’t utilizing AI for the majority of your coding, you’re likely lagging. We are now seeing similar early sparks of innovation in science that we witnessed in coding.”
He continues: “I believe that in a year’s time, if you’re a scientist and aren’t heavily employing AI, you’ll be missing an opportunity to enhance both the quality and speed of your thought processes.”