The karate class was the first clue. Kyra, nine years old at the time, couldn’t get as low in her stances as she used to. She slowed down during soccer practice. She started walking and running on her toes. Her pediatrician couldn’t figure it out. A specialist referral began what would become a nearly twenty-year journey through tests, treatments, and consultations — all without an answer.
Kyra’s story isn’t unique. It’s the reality for roughly half of all families who pursue genomic sequencing for a rare disease. They get tested. They wait. They get an inconclusive result. And then they wait some more, sometimes for decades, wondering if the answer is hiding in data they’ve already paid to generate.
But something is shifting. In June 2026, researchers from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI published a study in NEJM AI that offers a glimpse of what’s possible when you combine the massive knowledge base of a large language model with the painstaking work of clinical geneticists.
They took 376 previously unsolved cases — cases that had already been examined by multiple commercial and institutional pipelines, discussed by multidisciplinary teams, and still came up empty — and ran them through an AI-assisted reanalysis workflow using OpenAI‘s o3 Deep Research reasoning model. The result: 18 new diagnoses. A 4.8% diagnostic yield on cases that experts had already given their best shot.
That number might sound modest. But in the world of rare disease diagnostics, where every percentage point represents real families getting real answers after years of searching, it’s meaningful. And it’s part of a broader trend that’s quietly transforming how we think about genomic data.
The diagnostic odyssey is a phrase you hear constantly in rare disease circles. It describes the average five to seven years — sometimes far longer — that patients spend bouncing between specialists, undergoing invasive tests, and collecting misdiagnoses before landing on the right one. For many, genomic sequencing was supposed to be the shortcut. You sequence the genome, you find the variant, you get the diagnosis.
Except it doesn’t always work that way.
Even with exome or genome sequencing, diagnostic yields typically range from 20 to 70 percent, depending on the condition. That means anywhere from 30 to 80 percent of patients walk away without an answer. Their data sits in a database. Their questions remain unanswered.
Here’s the thing those initial reports don’t always emphasize: an inconclusive result isn’t necessarily a permanent one. The patient’s genome hasn’t changed. But the world’s understanding of it has.
Every year, researchers link new genes to diseases. Labs reclassify variants once labeled “of uncertain significance” as pathogenic or benign. Case reports accumulate in databases. Scientific literature expands. Each of those updates can make an old inconclusive case worth revisiting. The problem is scale. As Dr. Catherine Brownstein of the Manton Center puts it: “The bottleneck is time. An expert can devote only so much of their day to any one particular person.“
Alan Beggs, director of the Manton Center, frames it differently: “Researchers like Catherine and me can’t possibly keep 8,000 different diseases in our heads. That’s the power of AI.“
The study published in June 2026 wasn’t the first to show that reanalysis works. A systematic collaborative reanalysis effort in Catalonia, Spain, reanalyzed genomic data from 323 families with neurologic rare diseases and found diagnoses for 20.7% of patients. The pan-European Solve-RD project pooled data from 6,447 patients across 12 countries and delivered new diagnoses for 8.4% of the cohort, with an additional 4.1% identified through expert review. These studies relied on systematic bioinformatics pipelines and human expert networks.
What makes the OpenAI study different is the reasoning model at its center.
The workflow worked like this: For each case, researchers assembled a de-identified packet containing standardized phenotype terms (using the Human Phenotype Ontology), clinician notes, metadata like age and gender, and a filtered variant table capturing each variant’s rarity, predicted protein effect, ClinVar classification, and signal quality across family members. Most cases included trio data — the child and both biological parents.
They didn’t just ask the model to rank genes. They asked it to propose the most plausible molecular explanation and, crucially, to show its work. The model was prompted to connect clinical features, inheritance patterns, variant evidence, and scientific literature into a justification that a human reviewer could interrogate. At least two team members reviewed each candidate. Disagreements were resolved by consensus. A model output was never treated as a diagnosis — it was a lead. A diagnosis only counted after qualified experts reviewed the evidence, classified the variant as pathogenic or likely pathogenic under ACMG/AMP guidelines, a CLIA-certified lab confirmed it, and the clinical team returned the result to the family.
Before touching the unsolved cases, the team stress-tested the workflow on cases with known diagnoses. It recovered the correct gene and variant in duplicate runs for 48 of 51 cases across various rare conditions. In a set of 57 neuromuscular cases, it returned the correct diagnosis in duplicate runs for 45. In a 15-case long-read genome set, it named the correct gene in every case and both disease-causing alleles in 12. The model’s self-reported confidence scores tracked with correctness — mean minimum score of 85.6 for consistently correct calls versus 42.1 for incorrect or unknown calls. Not calibrated probabilities, but useful signals for guiding expert attention.
Then they applied it to the 376 unsolved cases across four cohorts: 100 children with neurodevelopmental conditions (10% yield), 61 with rare neuromuscular disease (6.6% yield), 200 cases of sudden unexpected death in pediatrics (1% yield), and 15 children and adolescents with early psychosis (13.3% yield, though the small cohort size means wide confidence intervals).
Of the 18 diagnoses, seven were rediscoveries — diagnoses established outside the local research workflow but absent from the record the team reviewed. In several cases, the variants were already listed as pathogenic in public databases, highlighting the operational challenge of synthesizing information across fragmented data sources.
The model also did things that surprised the researchers.
In one early-psychosis case, it inferred a structural variant — a 22q11.2 deletion associated with DiGeorge syndrome — that wasn’t even listed in the input data. It connected a run of low-quality calls on chromosome 22 with the child’s cardiac, immune, neurodevelopmental, and psychiatric features, then hypothesized the deletion. Follow-up genome sequencing confirmed it.
In another case, the model surfaced two genes — LAMA2 and FOXP1 — that together explained a complex presentation of muscle and neurodevelopmental features. Another had a previously unrecognized digenic explanation involving TTN and SRPK3. The prompt had asked for one monogenic cause. The model recognized when one wasn’t enough.
Beyond diagnoses, the model generated a novel mechanistic hypothesis for vitiligo in a neurodevelopmental case, highlighting an 11-amino-acid deletion in S1PR1 — a receptor involved in immune cell movement and tissue biology — and proposing how it could alter signaling in ways that reduce pigment production while helping immune cells persist in the skin. That hypothesis needs experimental validation. But it illustrates a role for AI that goes beyond pattern matching: translating scattered findings from structural biology, immunology, and clinical genetics into concrete, testable ideas.
The team also saw possible phenotype expansion in the neuromuscular cohort. Damaging variants in HSPB8 and CDK13 didn’t perfectly match the genes’ best-known disorders, suggesting a broader clinical spectrum that more cases and lab work will need to test.
And then there was Kyra. Her case was one of the four diagnoses in the neuromuscular cohort. The team linked her condition to a frameshift variant in HSPB8, diagnosing a form of myofibrillar myopathy — a condition where abnormal protein structures build up in muscle fibers and contribute to weakness. A genetic counselor from the Manton Center called her about a week before her 28th birthday. By then, she had spent much of her life adapting to the disease, dependent on a ventilator and in a wheelchair since age 13, though her condition has since plateaued. Her form of myofibrillar myopathy is so rare that little is known about its long-term course. But the diagnosis brought something she hadn’t had in nearly two decades: closure.
The study’s authors are careful about limitations. It was retrospective. The cohorts were heterogeneous. Reviewers weren’t blinded to model confidence. They didn’t measure time saved, cost, clinician effort, false-positive workload, or changes in care. They didn’t systematically evaluate structural variants, repeat expansions, deep-intronic changes, or mosaicism. Large language models can misread context or produce plausible explanations that fail on closer inspection — which is why every result passed through human adjudication and clinical confirmation.
“This study is not evidence that patients, clinicians, or customers should use OpenAI models to diagnose disease or make medical decisions,” the authors write. “It does not describe or endorse an intended customer use of OpenAI o3 Deep Research, ChatGPT, or any other OpenAI product for diagnosis. The model did not diagnose any participant; physicians and other qualified clinical experts made every diagnosis through established review, testing, and clinical-confirmation processes.“
That disclaimer matters. The hype cycle around AI in medicine has trained people to expect replacement narratives — AI replaces doctors, AI replaces radiologists, AI replaces genetic counselors. This study is explicitly not that. It’s about a reasoning layer that widens the search and focuses the subsequent human-led analysis. The model doesn’t decide what information or diagnosis gets returned to a family. Specialists do.
So, where does this go from here?
The researchers call for prospective, multi-center studies comparing LLM-assisted reanalysis with standard practice on diagnostic yield, time to candidate, clinician effort, false-positive burden, cost, and effects on care. Versioned prompts, reference checks, audit logs, and calibrated uncertainty will be important for reproducibility and safety. Qualified clinicians would still evaluate evidence, order appropriate tests, and make any diagnosis or treatment decision.
Newer general-purpose models can search and synthesize more scientific material. Purpose-built systems like GPT-Rosalind are designed for deeper life-sciences work, including variant effects on protein structure and function. Those capabilities will require their own evaluations and access controls.
The Manton Center will lead the next stage through a grant from the OpenAI Foundation, developing a platform-agnostic, low-cost genetics AI copilot that helps clinical teams analyze rare disease cases more quickly and consistently. The longer-term research opportunity is exploring whether expert-led AI-assisted reanalysis can help scientific understanding keep pace with discovery.
The market signals are already moving. The AI in rare disease diagnostics market was valued at $1.7 billion in 2025 and is projected to reach $19.4 billion by 2035, growing at a 28.7% CAGR. Companies like 3billion have built automated reanalysis pipelines that reanalyze genomic data every six months for three years, triggered by new gene-disease associations, new patient symptoms, or algorithm updates. The Undiagnosed Diseases Network International continues expanding its standardized data-sharing framework. Australia is building its own national Undiagnosed Diseases Network with AI-driven reanalysis platforms, aiming to boost diagnosis rates from 50% to more than 70%.
For families still waiting, the promise isn’t that AI replaces a doctor’s diagnosis. It’s that carefully evaluated research tools may help specialists identify evidence worth investigating. Today’s unanswered questions don’t have to remain unanswered forever.
Kyra got her answer at 27. The data that held it had been sitting there for years. The knowledge needed to interpret it arrived later. The bridge between them — in this case, a reasoning model that could synthesize across phenotype, variant data, and literature — made the connection possible.
There are thousands of Kyras. Their genomes are already sequenced. Their data is already stored. The answers are accumulating in databases and literature every day. The challenge has never been the sequencing. It’s the synthesis. And for the first time, we’re building tools that can actually help with that.
Not perfectly. Not autonomously. Not without experts in the loop. But help — real, measurable, clinically confirmed help — is starting to arrive.
Discover more from GadgetBond
Subscribe to get the latest posts sent to your email.
