The hard problem of scientific AI reasoning

For the past decade, the tech and science worlds have been repeating the same mantra: biology is basically just a data science now. Thanks to the plummeting costs of genome sequencing and the rise of massive, biobank-scale datasets, we are absolutely swimming in biological information. But generating data is the easy part. The real bottleneck has shifted from collecting the samples to actually figuring out what they mean.

It turns out that scientific data rarely arrives with a neat set of instructions. When a researcher looks at a genomic dataset, they have to figure out if a weird pattern is an actual biological breakthrough or just an artifact of a messy experiment. They have to know when to revise their assumptions, when to change their model, and when a result is actually ready to influence a clinical decision. In the industry, they call this elusive skill “research taste.” And until now, it’s been incredibly hard to tell if artificial intelligence actually has any.

That’s the exact problem OpenAI is trying to solve with the quiet but significant release of GeneBench-Pro, a research-level benchmark designed specifically to test how well AI agents handle the chaotic, ambiguous world of computational biology.

If you follow AI, you know that benchmarks are a dime a dozen. We have tests for coding, tests for the bar exam, and tests for writing poetry. But evaluating high-level scientific reasoning is notoriously tricky. Traditional biology benchmarks usually rely on historical datasets. The problem there is that if you give an AI an old dataset, there often isn’t just one “right” path to the answer. One agent might make a defensible choice about where to draw a statistical cutoff, while another makes a different, equally valid choice. When the benchmark grades them, it often ends up reflecting the arbitrary preferences of the person who wrote the test rather than the actual intelligence of the model. Worse, if a test is poorly designed, an AI can make a catastrophic analytical error but still stumble into a passing grade by pure numerical luck.

OpenAI took a different route with GeneBench-Pro. Instead of recycling old, messy data, they built 129 complex problems completely synthetically. Because they control the underlying simulation and know the exact causal structure of the data, they can definitively track whether an AI is doing the math right or just guessing. To get a passing grade, an AI is dropped into an isolated workspace with a brief prompt, some raw data files, and a standard bioinformatics toolkit. It has to explore the data, run diagnostics, pivot when things look wrong, and spit out a final, actionable answer—like deciding whether a specific targeted therapy will actually benefit a hypothetical cancer patient based on their structural variants.

The results are a fascinating snapshot of exactly where AI currently stands in the summer of 2026.

According to OpenAI’s release, their frontier model, GPT-5.6 Sol, managed a pass rate of 31.5% at its highest reasoning setting. On paper, failing nearly 70% of the time might not sound like a slam dunk. But context is everything here. When OpenAI first started building the original GeneBench a while back, their best model at the time—GPT-5—couldn’t even crack a 5% pass rate. The leap to nearly a third of these highly complex problems being solved autonomously shows that models are rapidly moving beyond just regurgitating textbook facts and are starting to grasp systems-level scientific reasoning.

Interestingly, the benchmark also highlights a widening gap in the AI arms race. While GPT-5.6 Sol is clearing the 31% mark, rival models are struggling to keep up with this specific flavor of quantitative uncertainty. According to the data, competitors like Opus 4.8 and Gemini 3.5 Flash hovered around 16% and 8%, respectively, while top-tier open-source models like GLM 5.2 barely cracked 4%. It suggests that while open-source models have become fiercely competitive in general coding, the messy, iterative logic required for real scientific research is still largely dominated by the most heavily resourced frontier models.

But the most compelling part of the GeneBench-Pro release isn’t the leaderboard; it’s the economics.

OpenAI sent dozens of these benchmark questions to external domain experts—professors, postdocs, and industry scientists—to make sure they were actually realistic. The consensus was that a single one of these problems would take a human expert roughly 20 to 40 hours of focused work to complete. If you factor in a conservative rate of $200 an hour for a highly specialized computational biologist, you’re looking at thousands of dollars of human labor just to reach one analytical conclusion. An AI inference run, even a computationally heavy one that takes its time to reason, costs a few dollars.

We aren’t at the point where AI is replacing human scientists. As the reviewers noted, models still struggle with the kind of deep inferential loops that human experts navigate instinctively—like realizing that a weird anomaly in the data is actually a swapped ancestry label rather than a novel genetic discovery. Novice AIs, much like novice grad students, can spot the anomaly but often fail to integrate it into the bigger picture.

But if these models are improving this fast, the implications for the pharmaceutical and biotech industries are massive. Drug discovery is a notoriously expensive, high-attrition game. Treatments backed by strong genetic evidence are significantly more likely to make it through clinical trials, but unearthing that evidence requires sifting through mountains of data. If AI agents can reliably automate even a fraction of the hypothesis triage and exploratory analysis currently handled by entire teams of human experts, it could radically accelerate how fast we turn raw biobank data into actual medicine.

GeneBench-Pro isn’t just a test to see how smart AI is getting. It’s a very clear signal that the tech industry is no longer satisfied with building models that just write code and draft emails. They are aiming squarely at the scientific method itself.