Big techs have been playing with large language models such as GPT-3 or PaLM for a while now. Google joined the party recently as a result of Open AI’s ChatGPT. The company has launched ChatGPT-like Chatbot MultiMediaQA for healthcare. It was specifically designed to answer medical queries.
MultiMedQA combines HealthSearchQA (a new, free-response database of medical questions) with six open-question answer datasets that cover professional medical exams, consumer queries, and research.
The model includes a methodology to evaluate human model responses on several axes, including precision, factuality, and potential harm.
MultiMedQA offers multiple-choice questions and longer answers to questions from medical professionals and others. These include the clinical topics datasets of MedQA, MedMCQA, and PubMedQA, as well as MedicationQA, LiveQA, MedicationQA, and MMLU. To improve MultiMedQA, HealthSearchQA added a new dataset with curated, frequently-searched medical inquiries.
HealthsearchQA’s dataset of 3375 consumer questions was created using the seed medical diagnoses and related symptoms. Users who entered seed phrases were shown frequently asked questions. These questions were pulled from the seed data and were created by a search engine.
This model was based on PaLM (an LLM with 540 billion parameters) and Flan-PaLM (an instruction-tuned variant of Flan-PaLM), which were used to evaluate LLMs through MultiMedQA.
Flan-PaLM has a superior SOTA performance on MedQA and MedMCQA clinical topics. It combines a few-shot, a chain of thought (CoT), and self-consistency prompting techniques. This combination often surpasses strong LLM baselines by large margins. FLAN-PaLM scores more than 17% higher on the MedQA dataset USMLE questions than the previous SOTA. However, the human evaluation identifies significant gaps between Flan-PaLM answers.
Med-PaLM is the model that has been developed to address this problem. It claims to be more efficient than Flan-PaLM and must surpass a human medical expert’s judgment.
A group of doctors found that 92.6% of Med-PaLM responses were comparable to clinician-generated answers (92.9%). In comparison, only 61.9% of long-form Flan–PaLM answers were considered under scientific consensus. Flan-PaLM also found that 5.8% of MedPaLM responses potentially contributed to adverse consequences. This was comparable to the 6.5% clinician-generated answers. Only 29.7% of FlanPaLM responses were.