OpenAI's GPT-Rosalind leads LifeSciBench

OpenAI‘s new LifeSciBench doesn’t pull punches. The company’s latest benchmark, released June 17, puts frontier AI models through 750 tasks written by 173 PhD-level scientists who actually do drug discovery for a living. The best model tested — OpenAI’s own specialized GPT-Rosalind — passed just 36.1% of them.

That number lands differently when you understand what these tasks actually ask. They’re not multiple-choice biology questions. They’re the kind of messy, judgment-heavy problems that show up in real research: pressure-testing an FDA meeting package for a gene therapy, reconciling conflicting assay data, designing a CRISPR donor sequence that won’t fail in the lab. The stuff where being “mostly right” still means the experiment fails.

The gap between knowing and doing

For years, AI benchmarks in life sciences have skewed toward the clean and the knowable. Protein folding predictions. Multiple-choice exam questions. Structured tasks with clear right answers. But anyone who’s spent time in a wet lab knows that’s not where the bottlenecks are. The hard part is deciding what experiment to run next when the data is incomplete, the assays disagree, and the timeline is fixed.

LifeSciBench was built to measure that. OpenAI surveyed practicing scientists about their actual workflows, then grouped the responses into seven categories: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication. Each of the 750 tasks sits in one of those workflows and spans one of seven biological domains — genomics, medicinal chemistry, clinical science, and so on.

The tasks come with artifacts: 1,062 of them across the benchmark. Figures, tables, sequence files, chemical structures, PDFs. More than half the tasks require models to actually read and reason over at least one artifact. That’s where things fall apart fast.

The artifact problem

GPT-Rosalind scores 45.1% on text-only tasks. Drop in a figure or a sequence file and that plunges to 28.1%. GPT-5.5 shows the same cliff: 29.9% to 21.9%. The models can synthesize a literature review decently well. Ask them to extract the right number from a Western blot quantification table and they start hallucinating band intensities.

This isn’t a minor limitation. Real scientific work is artifact-heavy. You don’t get clean prompt text; you get a messy supplementary PDF with a critical table on page 47. If the model can’t reliably pull structured information from that, it can’t be a research partner. It’s just a very confident summarizer.

How the grading works — and why it matters

Most benchmarks check a final answer against a key. LifeSciBench uses 19,020 rubric criteria — about 25 per task — written by the same experts who authored the tasks. A model gets partial credit for each criterion: a correct calculation here, a proper caveat there, the right statistical test identified. A task only counts as “passed” if the model clears 70% of the available points.

That design reflects how science actually gets evaluated. A reviewer doesn’t just check your conclusion; they check whether you used the right control, whether you acknowledged the limitation in your assay, whether your dose-response curve actually supports the IC50 you claimed. The rubric approach captures that. It also reveals something interesting: on roughly 14% of tasks, models earned substantial partial credit but still failed the pass threshold. They got partway there — identified the right evidence, made a plausible argument — but missed a critical constraint or flubbed a calculation.

Where the models are actually improving

The progress is real in specific pockets. Scientific communication — organizing evidence into a coherent expert-facing argument — jumps from 56.3% pass rate on GPT-5.5 to 71.1% on GPT-Rosalind. Translation, the bench-to-bedside reasoning that connects preclinical data to clinical implications, shows a similar leap: 36.8% to 57.7%. These are the tasks where the output format is stable and the job is judgment and synthesis. Models are getting genuinely useful there.

But design and optimization — the workflow where you’d actually ask the model to design an experiment or optimize a construct — sits at 30.7% for GPT-Rosalind. Analysis is 30.3%. When the task demands exact outputs like a CRISPR donor sequence or an siRNA design, GPT-Rosalind hits 24% and 14.8%, respectively. In those domains, a small error doesn’t just lower a score; it makes the reagent useless.

The human layer behind the benchmark

The tasks didn’t come from OpenAI researchers sitting in a room. They came from 173 scientists with PhDs and industry experience in biotech and pharma drug discovery programs. Each task went through an average of six automated revision cycles and at least two rounds of expert review, with 90% agreement required among domain reviewers. Then a separate cohort of 453 independent reviewers — 97% with doctorates, averaging 12 years of experience — validated the whole set. Agreement exceeded 96% on relevance, reasoning quality, scientific grounding, and overall usefulness.

That’s an extraordinary amount of expert labor poured into an evaluation. It also means the benchmark carries the blind spots of its creators: industry workflows, mostly. Academic discovery science, early-stage basic research, the weird one-off projects that don’t fit a standard pipeline — those are largely outside the scope.

What this means for the lab today

If you’re a principal investigator wondering whether to hand off literature reviews to an AI, LifeSciBench says: yes, with supervision. The models are strong at synthesis and communication. If you’re hoping they’ll design your next mutagenesis library or troubleshoot your qPCR assay, the data says: not yet. The artifact gap alone makes them unreliable for anything that requires reading your actual data files.

OpenAI frames this as a foundation, not a finish line. The next step, they say, is deployment studies in live research settings — watching what happens when models sit alongside scientists over months, not single-turn evaluations. That’s the right question. Benchmarks measure capability. Deployment measures utility. They’re not the same thing.

For now, LifeSciBench does something valuable: it stops the field from pretending that acing a multiple-choice biology exam means the model is ready for the lab. The 36.1% pass rate isn’t a disappointment. It’s an honest baseline. And in a field full of inflated claims, an honest baseline is the most useful thing you can have.