- Item 1
- Item 2
- Item 3
- Item 4
Google’s New Medical AI Passes Medical Exam and Outperforms Actual Doctors
A medical domain AI developed by Google Researchers broke records on its ability to pass medical exam questions, but more surprisingly generated answers that were consistently rated as better than human doctors. While the study notes several caveats, it marks a significant milestone in how AI could upend a number of professions.
A new language model created by Google Researchers has set a new record for accuracy on US medical license test questions. Photo illustration: Artisana
🧠 Stay Ahead of the Curve
A Google-created medical AI model, Med-PaLM 2, scored an impressive 86.5% on questions styled after the US Medical Licensing Examination.
The AI model’s answers were also rated by actual doctors to be better than doctor-generated responses in 8 of 9 dimensions.
This advancement signals a potential paradigm shift in healthcare as AI models are increasingly capable of working in complex fields.
May 18, 2023
A new study from Google’s research division shows that Med-PaLM 2, their AI language model specifically trained in medical knowledge, scored an astounding 86.5% on a question set styled after the US Medical Licensing Examination (USMLE), well surpassing the typical 60% pass threshold for human examinees. More importantly, a panel of human doctors consistently preferred Med-PaLM 2's answers to those offered by actual physicians, a sign of the massive leaps in progress AI models have made in mere months.
"Answering medical questions by applying medical knowledge and reasoning at a level comparable to doctors has long been seen as a significant challenge," the researchers observed. Their study’s findings represent substantial progress towards this ambitious goal.
Methodology of the Study
Med-PaLM 2 was built from Google’s foundational language model, PaLM 2, and then fine-tuned with specific medical domain data. The researchers also implemented an innovative prompting strategy called ensemble refinement, employing techniques like chain-of-thought and self-consistency to enhance the model’s medical reasoning in multiple-choice queries.
The model was put through its paces against multiple-choice and long-form questions from the MultiMedQA evaluation set, a database of thousands of questions, many modeled after the official USMLE.
Researchers additionally subjected Med-PaLM 2 to a panel of 15 physicians in two supplementary experiments:
A team of doctors assessed pairs of AI-generated and doctor-written responses in a pairwise ranking evaluation across nine aspects, including reasoning, consensus, and knowledge recall.
Two adversarial datasets were utilized to produce answers probing the limits of the AI model. The responses were then assessed by doctors for risk factors such as demographic bias, irrelevant information, and potential harm.
Med-PaLM 2's performance on the MedQA benchmarks displayed a marked performance leap from its predecessor, whose results were released three months ago. While Med-PaLM scored 67.2%, a score sufficient to pass the actual USMLE, Med-PaLM 2 scored 86.5% on the same question set. A score of 60% is the typical standard for human candidates to pass the USMLE.
Researchers noted that some of the areas where the most significant leaps were observed was in Med-PaLM 2's quality of long-form responses. Other language models, such as OpenAI’s GPT-4, have also seen considerable progress in providing more robust long-form answers over their predecessors.
In a pairwise study where a human panel of physicians graded AI-created answers against physician-generated answers across 1066 questions, Med-PaLM 2 answers were often rated as higher quality compared to physician answers.
Specifically, AI-generated answers outperformed human answers along eight of nine dimensions:
Better reflecting medical consensus
Better reading comprehension
Better knowledge recall
Less extent of harm
Less likelihood of arm
Less likelihood of demographic bias
Less likely to omit important information
Physician-generated answers outperformed Med-PaLM 2 answers on just one dimension: not containing inaccurate or irrelevant information.
An adversarial dataset with questions specifically designed to elicit the AI model to generate harmful or biased answers was also tested. Here, Med-PaLM 2 answers were compared against its predecessor model, Med-PaLM 1, and human physicians graded the Med-PaLM 2 answers as substantially better than its predecessor across the key nine dimensions.
Limitations and Future Prospects
Despite its impressive performance, researchers caution against over-extrapolating Med-PaLM 2's potential for real-world scenarios.
The model's proficiency in answering MedQA questions may not necessarily translate to the complexities of real-life situations, which demand more nuanced understanding, context, and empathy. Interestingly, empathy wasn't tested as part of the evaluation criteria for the study.
Furthermore, the study's methodology may not mirror the challenges of actual medical practice, the researchers note. For example, physicians providing human answers were not prompted with specific clinical scenarios but more generic circumstances.
Moreover, the study involved answering questions only once, without follow-ups, a stark contrast to the iterative process of medicine involving discovery and ongoing case management. Lastly, the quality of physicians' answers might have been influenced by the absence of examples of low or high-quality responses, which could have served as a reference point for their submissions.
The breakthrough results of Med-PaLM 2 emerge amidst a surge in investment in domain-specific AI models across various fields. Companies are increasingly looking to leverage AI's potential to revolutionize a wide range of professions, while researchers continue to expand the boundaries of AI in their studies.
Venture capital giants Andreessen Horowitz and Sequoia Capital have just announced significant investments in the medical and legal tech sectors respectively. Andreessen Horowitz has invested $50M in seed funding in Hippocratic AI, a company developing a language model for the medical sector, focusing on non-diagnostic tests and patient communication. Meanwhile, Sequoia Capital led a $21M Series A round in Harvey, a company creating a chatbot for corporate lawyers.
NewsAI and Media Titans Quietly Hash Out Future of Content Licensing
June 16, 2023
ResearchIn Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than Chance
June 09, 2023
NewsHigh-Profile AI Leaders Warn of “Risk of Extinction” from AI
May 30, 2023
NewsKey Takeaways from OpenAI CEO Sam Altman's Senate Testimony
May 16, 2023
NewsOpenAI Readies Open-Source Model as Competition Intensifies
May 15, 2023
ResearchChatGPT Trading Algorithm Delivers 500% Returns in Stock Market
May 10, 2023
NewsLeaked Google Memo Claiming “We Have No Moat, and Neither Does OpenAI” Shakes the AI World
May 05, 2023
NewsChegg’s Stock Tumble Serves as Wake Up Call on the Perils of AI
May 03, 2023
NewsHollywood Writers on Strike Grapple with AI’s Role in Creative Process
May 02, 2023
ResearchGPT AI Enables Scientists to Passively Decode Thoughts in Groundbreaking Study
May 01, 2023
NewsChatGPT Grows in Popularity as Bing and Bard Flatline
April 27, 2023
ResearchStanford/MIT Study: GPT Boosts Support Agent Productivity by up to 35%
April 26, 2023
NewsSnap's My AI Feature Faces Unexpected Backlash from Users
April 24, 2023
News"Next to Impossible": OpenAI's ChatGPT Faces GDPR Compliance Woes
April 20, 2023
NewsMicrosoft's AI Chip Strategy Reduces Costs and Nvidia Dependence
April 18, 2023
News4 Million Accounts Compromised by Fake ChatGPT App
April 17, 2023
NewsEU's AI Act: Stricter Rules for Chatbots on the Horizon
April 14, 2023
ResearchStudy: Assigning Personas Creates a Sixfold Increase in ChatGPT Toxicity
April 13, 2023
ResearchGPT-4 Outperforms Elite Crowdworkers, Saving Researchers $500,000 and 20,000 hours
April 11, 2023
ResearchGenerative Agents: Stanford's Groundbreaking AI Study Simulates Authentic Human Behavior
April 10, 2023
ResearchBye-Bye, Mechanical Turk? How ChatGPT is Making Humans Obsolete
April 09, 2023
NewsMayor Threatens Landmark Defamation Lawsuit Against OpenAI's ChatGPT
April 06, 2023
NewsOpenAI's ChatGPT Suspended in Italy Amid Privacy and Cybersecurity Concerns
March 31, 2023
NewsCiting "Profound Risks to Society," Prominent AI Experts Call for Pause
March 29, 2023
NewsEuropol Warns of ChatGPT's Dark Side as Criminals Exploit AI Potential
March 28, 2023
CultureAs Online Users Increasingly Jailbreak ChatGPT in Creative Ways, Risks Abound for OpenAI
March 27, 2023
NewsAI Researchers Voice Disappointment at GPT-4’s Lack of Openness
March 16, 2023