Study finds health care evaluations of large language models lacking in real patient data and bias assessment
A new systematic review reveals that only 5% of health care evaluations for large language models use real patient data, with significant gaps in assessing bias, fairness, and a wide range of tasks, underscoring the need for more comprehensive evaluation methods.
Study: Testing and Evaluation of Health Care Applications of Large Language Models. Image Credit: BOY ANTHONY/Shutterstock.com
In a recent study published in JAMA, researchers from the United States (U.S.) conducted a systematic review to evaluate various aspects of existing large language models (LLMs) used for healthcare applications, such as the healthcare tasks and data assessed types, to identify the most useful areas in healthcare for the application of LLMs.
Background
The use of artificial intelligence (AI) in healthcare has advanced rapidly, especially with the development of LLMs. Unlike predictive AI, which is used to forecast outcomes for processes, generative AI using LLMs can create a wide range of new content, such as images, sounds, and text.
Based on user inputs, LLMs can generate structured and largely coherent text responses, which makes them valuable in the healthcare field. In some health systems in the U.S., LLMs are already being applied for notetaking and are being explored in the medical field to improve efficiency and patient care.
However, the sudden interest in LLMs has also resulted in unstructured testing of LLMs across various fields, and the performance of LLMs in clinical settings has been mixed. While some studies have found the responses from LLMs to be largely superficial and often inaccurate, others have found accuracy rates comparable to those of human clinicians.
This inconsistency highlights the need for a systematic evaluation of the performance of LLMs in the healthcare setting.
About the study
For this comprehensive systematic review, the researchers searched preprints and peer-reviewed studies on LLM evaluations in healthcare published between January 2022 and February 2024. This two-year window was selected to include the papers published after the launch of the AI chatbot ChatGPT in November 2022.
Three independent reviewers screened the studies, which were included in the review if they focused on LLM evaluations in healthcare. Studies on basic biological research or multimodal tasks were excluded.
The studies were then categorized based on the data type evaluated, the healthcare tasks, the natural language processing (NLP) and natural language understanding tasks, medical specialties, and evaluation dimensions. The framework for categorization was developed from an existing list of healthcare tasks, established evaluation models, and inputs from healthcare professionals.
The categorization framework considered whether real patient data was evaluated and examined 19 healthcare tasks, including caregiving and administrative functions. Additionally, six NLP tasks, including summarization and question answering, were included in the categorization.
Furthermore, seven dimensions of evaluation were identified, including aspects such as factuality, accuracy, and toxicity. The studies were also grouped by medical specialty into 22 categories. The researchers then used descriptive statistics to summarize the findings and calculate the percentages and frequencies for each category.
Results
The review found that the evaluation of LLMs in healthcare is heterogeneous, and there are significant gaps in task coverage and data usage. Among the 519 studies included in the review, only 5% used real patient data, and most of the studies relied on expert-generated snippets of data or medical examination questions.
Most of the studies focused on LLMs for medical knowledge tasks, especially through evaluations such as the U.S. Medical Licensing Examination.
Patient care tasks, such as diagnosing patients and making recommendations for treatment, were also relatively common among the LLM tasks. However, administrative tasks, including clinical notetaking and billing code assignments, were rarely explored among the LLM tasks.
Among the NLP tasks, most of the studies focused on question answering, which included generic inquiries. Approximately 25% of the functions used LLMs for text classification and extraction of information, but tasks such as conversational dialogue and summarization were not well explored through LLM evaluations.
The most frequently examined evaluation dimension through LLMs was accuracy (95.4%), followed by comprehensiveness (47%). Very few studies used LLMs for ethical considerations related to bias, toxicity, and fairness.
While more than 20% of the studies were not specific to any medical specialties, internal medicine, ophthalmology, and surgery were the most represented in the LLM evaluation studies. Medical genetics and nuclear medicine studies were the least explored in the LLM evaluations.
Conclusions
Overall, the review highlighted the need for standardized evaluation methods and a consensus framework for assessing LLM applications in healthcare.
The researchers stated that the use of real patient data in LLM evaluations should be promoted, and the use of LLMs for administrative tasks and expanding the application of LLMs to other medical specialty areas would be highly beneficial.
Journal reference:
-
Bedi, S., Liu, Y., OrrEwing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, Akash R, Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., & Shah, N. H. (2024). Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. doi:10.1001/jama.2024.21700.
Source link : News-Medica