OpenEvidence AI Becomes the First AI in History to Score Above 90% on the United States Medical Licensing Examination (USMLE)

CAMBRIDGE, Mass., July 14, 2023 /PRNewswire/ — OpenEvidence, a generative Artificial Intelligence (AI) company working on aligning Large Language Models (LLMs) to the medical domain, announced today that OpenEvidence AI has become the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE). Previously, AIs such as ChatGPT and Google’s Med-PaLM 2 have reported scores of 59% and 86%, respectively.

OpenEvidence AI Becomes the First AI in History to Score Above 90% on the United States Medical Licensing Examination

“The horizon of the possible in Artificial Intelligence (AI) has been redefined yet again, as OpenEvidence AI becomes the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE). Single-point differences on this benchmark translate into highly impactful differences in AI performance, since the USMLE contains hundreds of questions, and each additional USMLE score point represents multiple additional correct answers—each one of which corresponds to medical knowledge that could translate into life or death for a patient, if the AI system is used as a physician co-pilot in a clinical setting,” said Daniel Nadler, PhD, Founder of OpenEvidence. “A widely cited study published in the BMJ in 2016 estimated that medical errors were the third leading cause of death in the United States, after heart disease and cancer. At that scale, any system that could augment a physician and reduce medical errors on an absolute basis by even 5-10% would be extraordinarily impactful to the lives of tens of thousands of patients in the United States alone. On a relative basis, and treating the previous state-of-the-art systems as a baseline, OpenEvidence AI makes 77% fewer errors on the US Medical Licensing Exam than ChatGPT, and 31% fewer errors than Google’s Med-PaLM 2, thereby achieving the lowest error rate in the history of any AI on the USMLE. It’s fair to consider the relative performance of these AIs in this manner, given the disproportionate effect of an error in medicine.”

Generative AI & the US Medical Licensing Exam

The USMLE is a three-step examination for medical licensure in the United States. It assesses a physician’s ability to apply knowledge, concepts, and principles, as well as demonstrate fundamental patient-centered skills that form the foundation of safe and effective patient care. The USMLE is a rigorous test that demands a broad understanding of biomedical and clinical sciences, testing not only factual recall, but also decision-making ability. Artificial Intelligence achieving a score above 90% on the USMLE—a feat almost unthinkable even 18 months ago—showcases the tremendous strides that Artificial Intelligence generally—and OpenEvidence specifically—have made in understanding and applying complex medical concepts.

As of July 11^th, 2023, both GPT-4 and ChatGPT incorrectly answer (A) Blood cultures, whereas OpenEvidence AI correctly answers (C) Human leukocyte antigen-B27 assay.

Best Paper of 2023: The Association for Health Learning and Inference (AHLI)

Earlier this year, The New England Journal of Medicine AI featured a paper titled “Do We Still Need Clinical Language Models?” published by OpenEvidence, in partnership with researchers from MIT and Harvard Medical School, that found that language models that have been specialized to deal with medical text outperform much larger general domain models trained on general text (such as GPT-3) when compared on the same medical domain-specific intelligence tasks. OpenEvidence’s paper went on to win Best Paper at the 2023 Conference on Health, Inference, and Learning (CHIL), the preeminent community of computer scientists working in medical applications.

Founding Team from Harvard and MIT

OpenEvidence was founded by Daniel Nadler, a Harvard PhD who previously founded Kensho Technologies (which in 2018 was acquired in the largest AI deal in history at the time). OpenEvidence’s key scientists, including CTO Zachary Ziegler, Jonas Wulff, Micah Smith, Evan Hernandez, and Eric Lehman, all come out of artificial intelligence labs at Harvard and MIT. Eric Lehman (MIT) was the lead author of both this study and OpenEvidence’s award-winning paper, “Do We Still Need Clinical Language Models?”

Mayo Clinic Platform

Earlier this year, OpenEvidence became a Mayo Clinic Platform Accelerate company. In a social media post, Mayo Clinic Platform said “OpenEvidence is using novel technology to organize the world’s medical knowledge into understandable, clinically useful formats. As part of Mayo Clinic Platform Accelerate, they are one step closer to improving how health care information is structured.” Dr. Antonio Jorge Forte, a Mayo Clinic physician and the Terrance D. and Judith A. Paul Director of MayoExpert, said: “OpenEvidence can be the foundational technology to power all clinical decision tools.”