GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIOpenAIScienceTech

OpenAI’s GPT-Rosalind leads LifeSciBench — at a 36% pass rate

The best model tested — OpenAI's own GPT-Rosalind — passed just 36.1% of 750 tasks written by scientists who actually run drug discovery programs.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
Jun 18, 2026, 9:00 AM EDT
Share
We may get a commission from retail offers. Learn more
Abstract promotional graphic for LifeSciBench featuring layered design elements on a soft blue gradient background with light reflections and blurred yellow highlights. The composition includes a pale yellow rectangle, a scientific-style bar chart with error bars, and a large cropped text block reading “LifeSciBench” in bold black lettering on a light blue panel. The clean, modern layout combines data visualization and branding elements to represent a life sciences benchmarking or evaluation platform.
Image: OpenAI
SHARE

OpenAI‘s new LifeSciBench doesn’t pull punches. The company’s latest benchmark, released June 17, puts frontier AI models through 750 tasks written by 173 PhD-level scientists who actually do drug discovery for a living. The best model tested — OpenAI’s own specialized GPT-Rosalind — passed just 36.1% of them.

That number lands differently when you understand what these tasks actually ask. They’re not multiple-choice biology questions. They’re the kind of messy, judgment-heavy problems that show up in real research: pressure-testing an FDA meeting package for a gene therapy, reconciling conflicting assay data, designing a CRISPR donor sequence that won’t fail in the lab. The stuff where being “mostly right” still means the experiment fails.

The gap between knowing and doing

For years, AI benchmarks in life sciences have skewed toward the clean and the knowable. Protein folding predictions. Multiple-choice exam questions. Structured tasks with clear right answers. But anyone who’s spent time in a wet lab knows that’s not where the bottlenecks are. The hard part is deciding what experiment to run next when the data is incomplete, the assays disagree, and the timeline is fixed.

LifeSciBench was built to measure that. OpenAI surveyed practicing scientists about their actual workflows, then grouped the responses into seven categories: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication. Each of the 750 tasks sits in one of those workflows and spans one of seven biological domains — genomics, medicinal chemistry, clinical science, and so on.

The tasks come with artifacts: 1,062 of them across the benchmark. Figures, tables, sequence files, chemical structures, PDFs. More than half the tasks require models to actually read and reason over at least one artifact. That’s where things fall apart fast.

The artifact problem

GPT-Rosalind scores 45.1% on text-only tasks. Drop in a figure or a sequence file and that plunges to 28.1%. GPT-5.5 shows the same cliff: 29.9% to 21.9%. The models can synthesize a literature review decently well. Ask them to extract the right number from a Western blot quantification table and they start hallucinating band intensities.

This isn’t a minor limitation. Real scientific work is artifact-heavy. You don’t get clean prompt text; you get a messy supplementary PDF with a critical table on page 47. If the model can’t reliably pull structured information from that, it can’t be a research partner. It’s just a very confident summarizer.

How the grading works — and why it matters

Most benchmarks check a final answer against a key. LifeSciBench uses 19,020 rubric criteria — about 25 per task — written by the same experts who authored the tasks. A model gets partial credit for each criterion: a correct calculation here, a proper caveat there, the right statistical test identified. A task only counts as “passed” if the model clears 70% of the available points.

That design reflects how science actually gets evaluated. A reviewer doesn’t just check your conclusion; they check whether you used the right control, whether you acknowledged the limitation in your assay, whether your dose-response curve actually supports the IC50 you claimed. The rubric approach captures that. It also reveals something interesting: on roughly 14% of tasks, models earned substantial partial credit but still failed the pass threshold. They got partway there — identified the right evidence, made a plausible argument — but missed a critical constraint or flubbed a calculation.

Where the models are actually improving

The progress is real in specific pockets. Scientific communication — organizing evidence into a coherent expert-facing argument — jumps from 56.3% pass rate on GPT-5.5 to 71.1% on GPT-Rosalind. Translation, the bench-to-bedside reasoning that connects preclinical data to clinical implications, shows a similar leap: 36.8% to 57.7%. These are the tasks where the output format is stable and the job is judgment and synthesis. Models are getting genuinely useful there.

But design and optimization — the workflow where you’d actually ask the model to design an experiment or optimize a construct — sits at 30.7% for GPT-Rosalind. Analysis is 30.3%. When the task demands exact outputs like a CRISPR donor sequence or an siRNA design, GPT-Rosalind hits 24% and 14.8%, respectively. In those domains, a small error doesn’t just lower a score; it makes the reagent useless.

The human layer behind the benchmark

The tasks didn’t come from OpenAI researchers sitting in a room. They came from 173 scientists with PhDs and industry experience in biotech and pharma drug discovery programs. Each task went through an average of six automated revision cycles and at least two rounds of expert review, with 90% agreement required among domain reviewers. Then a separate cohort of 453 independent reviewers — 97% with doctorates, averaging 12 years of experience — validated the whole set. Agreement exceeded 96% on relevance, reasoning quality, scientific grounding, and overall usefulness.

That’s an extraordinary amount of expert labor poured into an evaluation. It also means the benchmark carries the blind spots of its creators: industry workflows, mostly. Academic discovery science, early-stage basic research, the weird one-off projects that don’t fit a standard pipeline — those are largely outside the scope.

What this means for the lab today

If you’re a principal investigator wondering whether to hand off literature reviews to an AI, LifeSciBench says: yes, with supervision. The models are strong at synthesis and communication. If you’re hoping they’ll design your next mutagenesis library or troubleshoot your qPCR assay, the data says: not yet. The artifact gap alone makes them unreliable for anything that requires reading your actual data files.

OpenAI frames this as a foundation, not a finish line. The next step, they say, is deployment studies in live research settings — watching what happens when models sit alongside scientists over months, not single-turn evaluations. That’s the right question. Benchmarks measure capability. Deployment measures utility. They’re not the same thing.

For now, LifeSciBench does something valuable: it stops the field from pretending that acing a multiple-choice biology exam means the model is ready for the lab. The 36.1% pass rate isn’t a disappointment. It’s an honest baseline. And in a field full of inflated claims, an honest baseline is the most useful thing you can have.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Leave a Comment

Leave a ReplyCancel reply

Most Popular

Perplexity Computer adds a Command Panel

Summer Sale gives Nothing’s lineup a more tempting price tag

Also Read
Collage of four web-based artifacts created with Claude Code, including an analytics dashboard, a mobile app design showcase, a software migration report, and a systems workflow visualization. The examples demonstrate interactive interfaces, data-rich dashboards, design systems, and technical documentation generated through AI-assisted development.

Live artifacts come to Claude Code

Illustration of a Claude Connectors settings panel with organization-wide access enabled. A large toggle switch labeled “Enable for organization” is turned on, and a hand-shaped cursor points to it. Below, a list of connected apps—Asana, Atlassian, Canva, Figma, and Granola—each displays an enabled blue toggle switch. The interface appears on a light gray background with a clean, minimalist design.

Claude just solved the enterprise AI authorization headache — and it only took one login

Abstract 3D visualization of a connected network represented as a dark globe covered with intersecting lines and glowing spherical nodes. The illuminated points appear linked across the curved surface, symbolizing artificial intelligence, neural networks, global data connections, and knowledge processing.

Perplexity launches Brain for its Computer agent

Simple illustration of a shopping bag with a keyhole symbol on the front, representing secure or private shopping, on a solid orange background.

Anthropic killed the API key (for workloads, at least)

Design editor interface displaying a crowdfunding webpage for Maple Grove Park alongside a Claude Code terminal window. The design canvas shows editable text, fundraising progress, and donation information, while Claude Code is used to synchronize design components between the visual editor and development workflow.

Claude Design adds admin controls, direct editing, and a connector army

Abstract science-themed graphic featuring a soft green and blue gradient background with layered geometric shapes. A chemical structure diagram labeled “4-hydroxy-TEMPO” appears in the upper-right section, while large cropped black typography partially displays the letters “Mo.” The composition combines molecular chemistry imagery with modern design elements, suggesting a scientific research, chemistry, or drug discovery platform.

OpenAI’s near-autonomous chemist just proved it can do real wet-lab science

Apple iCloud logo displayed on a blue gradient background. The image features the iCloud cloud icon centered above the “iCloud” wordmark in white, representing Apple’s cloud storage and synchronization service used for backing up data, syncing files, photos, documents, and settings across iPhone, iPad, Mac, Apple Watch, and other Apple devices.

Apple’s new private.icloud.com domain has a downside

Apple iCloud logo displayed on a blue gradient background. The image features the iCloud cloud icon centered above the “iCloud” wordmark in white, representing Apple’s cloud storage and synchronization service used for backing up data, syncing files, photos, documents, and settings across iPhone, iPad, Mac, Apple Watch, and other Apple devices.

Sign in with Apple and Hide My Email are getting a shared domain

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.