To measure real-world scientific impact, OpenAI designed LifeSciBench, an externally expert-judged benchmark that evaluates models across six workflow areas: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, and translation and communication . Unlike earlier benchmarks that tested isolated capabilities, LifeSciBench assesses full research workflows from literature synthesis to experimental design. GPT-Rosalind leads overall on LifeSciBench, outperforming GPT-5.5, Grok 4.3, and Gemini 3.1 Pro
.
OpenAI also published results on three domain-specific sub-benchmarks that compare GPT-Rosalind directly against GPT-5.5. All scores are self-reported by OpenAI, and independent third-party replication has not been confirmed .
| Benchmark | Domain | GPT-Rosalind | GPT-5.5 | Token Savings |
|---|---|---|---|---|
| MedChemBench | Medicinal chemistry: compound structure, activity, toxicity, metabolism | 27.5% | 25.1% | 7.2% fewer tokens |
| GeneBench | Genomics and quantitative biology: long-horizon agent tasks | 21.6% | 20.4% | 31% fewer tokens |
| LabWorkBench | Experimental wet-lab operations and troubleshooting | 63.2% | 55.8% | Not specified |
The largest absolute performance gap is on LabWorkBench, where GPT-Rosalind scores 7.4 percentage points higher than GPT-5.5 . This benchmark covers practical wet-lab operations—troubleshooting protocols, designing reagent workflows, and interpreting experimental outcomes—areas where agentic tool use and domain-specific knowledge directly translate into higher success rates.
The absolute scores are notably modest in medicinal chemistry and genomics (27.5% and 21.6% respectively), which means even the improved model still fails on a majority of tasks in these fields . These are hard, real-world scientific problems, not general-knowledge tests, so low scores reflect genuine difficulty rather than weak models. But the numbers underscore that AI-driven drug discovery remains in early stages.
The June update also brings life-sciences-specific plugins and viewers on Codex, OpenAI’s coding platform for agentic AI . These plugins allow GPT-Rosalind agents to connect directly to scientific tools—literature databases, sequence analysis utilities, lab-protocol design software—and execute multi-step research workflows rather than merely suggesting them
. This marks a shift from a model that reasons about science to one that can actively perform scientific computing tasks.
OpenAI’s framing of the update as a move “from reasoning to executed workflows” signals an ambition to embed GPT-Rosalind deeper into the operational pipelines of drug-discovery labs, not just the brainstorming phase.
GPT-Rosalind initially launched in April 2026 as a U.S.-only enterprise research preview through a restricted trusted-access program . The June update expands that to a global research preview for eligible organizations, still through the same trusted-access deployment structure that requires application and review
. Organizations without an existing Enterprise agreement can now apply through a newly created OpenAI managed workspace.
On May 29, 2026, just days before the main update, OpenAI also launched the Rosalind Biodefense program, expanding trusted access specifically for biodefense applications . This suggests OpenAI is positioning the model not just for commercial drug discovery but also for national-security-relevant biological research under controlled access.
OpenAI disclosed several high-profile organizations actively working with the updated GPT-Rosalind :
It’s worth noting that the specific use cases for each organization have not been detailed publicly, and the list likely includes both paying enterprise customers and collaborative research partners. The inclusion of Thermo Fisher—a major life-sciences equipment and reagents company—hints at potential tool-integration use cases alongside the Codex plugin ecosystem.
The updated GPT-Rosalind represents a clear evolution in OpenAI’s strategy: domain-specific vertical models paired with agentic tool use and controlled access. The combination of specialized scientific reasoning, lower token consumption, and Codex-based execution capabilities positions the model as a practical tool for accelerating early-stage drug discovery—literature synthesis, hypothesis generation, experimental design, and protocol execution.
But the modest absolute benchmark scores serve as a reality check. Even with GPT-5.5 under the hood, the model still struggles on most medicinal-chemistry and genomics tasks. The technology is improving rapidly, but it remains an assistive tool for expert researchers rather than a replacement for scientific judgment.
All performance claims remain self-reported by OpenAI, and until independent third-party evaluations are published—or peer-reviewed studies using the model appear—the benchmark results should be treated as directional rather than definitive .
Comments
0 comments