Tracking the Explosive World of Generative AI

Google’s New Medical AI Passes Medical Exam and Outperforms Actual Doctors

A medical domain AI developed by Google Researchers broke records on its ability to pass medical exam questions, but more surprisingly generated answers that were consistently rated as better than human doctors. While the study notes several caveats, it marks a significant milestone in how AI could upend a number of professions.

A new language model created by Google Researchers has set a new record for accuracy on US medical license test questions. Photo illustration: Artisana

🧠 Stay Ahead of the Curve

  • A Google-created medical AI model, Med-PaLM 2, scored an impressive 86.5% on questions styled after the US Medical Licensing Examination.

  • The AI model’s answers were also rated by actual doctors to be better than doctor-generated responses in 8 of 9 dimensions.

  • This advancement signals a potential paradigm shift in healthcare as AI models are increasingly capable of working in complex fields. 

By Michael Zhang

May 18, 2023

A new study from Google’s research division shows that Med-PaLM 2, their AI language model specifically trained in medical knowledge, scored an astounding 86.5% on a question set styled after the US Medical Licensing Examination (USMLE), well surpassing the typical 60% pass threshold for human examinees. More importantly, a panel of human doctors consistently preferred Med-PaLM 2's answers to those offered by actual physicians, a sign of the massive leaps in progress AI models have made in mere months.

"Answering medical questions by applying medical knowledge and reasoning at a level comparable to doctors has long been seen as a significant challenge," the researchers observed. Their study’s findings represent substantial progress towards this ambitious goal.

Methodology of the Study

Med-PaLM 2 was built from Google’s foundational language model, PaLM 2, and then fine-tuned with specific medical domain data. The researchers also implemented an innovative prompting strategy called ensemble refinement, employing techniques like chain-of-thought and self-consistency to enhance the model’s medical reasoning in multiple-choice queries.

The model was put through its paces against multiple-choice and long-form questions from the MultiMedQA evaluation set, a database of thousands of questions, many modeled after the official USMLE.

Researchers additionally subjected Med-PaLM 2 to a panel of 15 physicians in two supplementary experiments:

  • A team of doctors assessed pairs of AI-generated and doctor-written responses in a pairwise ranking evaluation across nine aspects, including reasoning, consensus, and knowledge recall.

  • Two adversarial datasets were utilized to produce answers probing the limits of the AI model. The responses were then assessed by doctors for risk factors such as demographic bias, irrelevant information, and potential harm.

Astounding Results

Med-PaLM 2's performance on the MedQA benchmarks displayed a marked performance leap from its predecessor, whose results were released three months ago. While Med-PaLM scored 67.2%, a score sufficient to pass the actual USMLE, Med-PaLM 2 scored 86.5% on the same question set. A score of 60% is the typical standard for human candidates to pass the USMLE.

Researchers noted that some of the areas where the most significant leaps were observed was in Med-PaLM 2's quality of long-form responses. Other language models, such as OpenAI’s GPT-4, have also seen considerable progress in providing more robust long-form answers over their predecessors. 

Med-PaLM 2's progress is another milestone in a series of rapid improvements in generative AI's ability to tackle questions in the medical domain.

In a pairwise study where a human panel of physicians graded AI-created answers against physician-generated answers across 1066 questions, Med-PaLM 2 answers were often rated as higher quality compared to physician answers. 

Specifically, AI-generated answers outperformed human answers along eight of nine dimensions:

  • Better reflecting medical consensus

  • Better reading comprehension

  • Better knowledge recall

  • Better reasoning

  • Less extent of harm

  • Less likelihood of arm

  • Less likelihood of demographic bias

  • Less likely to omit important information

Physician-generated answers outperformed Med-PaLM 2 answers on just one dimension: not containing inaccurate or irrelevant information. 

In a pairwise evaluation of human and AI-generated answers, a panel of human doctors rated Med-PaLM 2's answers as better in 8 of 9 dimensions.

An adversarial dataset with questions specifically designed to elicit the AI model to generate harmful or biased answers was also tested. Here, Med-PaLM 2 answers were compared against its predecessor model, Med-PaLM 1, and human physicians graded the Med-PaLM 2 answers as substantially better than its predecessor across the key nine dimensions.

Limitations and Future Prospects

Despite its impressive performance, researchers caution against over-extrapolating Med-PaLM 2's potential for real-world scenarios.

The model's proficiency in answering MedQA questions may not necessarily translate to the complexities of real-life situations, which demand more nuanced understanding, context, and empathy. Interestingly, empathy wasn't tested as part of the evaluation criteria for the study.

Furthermore, the study's methodology may not mirror the challenges of actual medical practice, the researchers note. For example, physicians providing human answers were not prompted with specific clinical scenarios but more generic circumstances.

Moreover, the study involved answering questions only once, without follow-ups, a stark contrast to the iterative process of medicine involving discovery and ongoing case management. Lastly, the quality of physicians' answers might have been influenced by the absence of examples of low or high-quality responses, which could have served as a reference point for their submissions.

The breakthrough results of Med-PaLM 2 emerge amidst a surge in investment in domain-specific AI models across various fields. Companies are increasingly looking to leverage AI's potential to revolutionize a wide range of professions, while researchers continue to expand the boundaries of AI in their studies.

Venture capital giants Andreessen Horowitz and Sequoia Capital have just announced significant investments in the medical and legal tech sectors respectively. Andreessen Horowitz has invested $50M in seed funding in Hippocratic AI, a company developing a language model for the medical sector, focusing on non-diagnostic tests and patient communication. Meanwhile, Sequoia Capital led a $21M Series A round in Harvey, a company creating a chatbot for corporate lawyers.

Read More: ChatGPT