GPT-4 demonstrates high accuracy in analyzing multilingual medical notes

Study evaluates GPT-4’s ability to process medical notes in English, Spanish, and Italian, achieving physician agreement in 79% of cases.

Study: The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Image Credit: SuPatMaN/Shutterstock.com

In a recent study published in the Lancet Digital Health, a group of researchers evaluated the ability of Generative Pre-trained Transformer 4 (GPT-4) to answer predefined questions based on medical notes written in three languages (English, Spanish, and Italian).

Background

Medical notes contain valuable clinical insights, yet their unstructured narrative format poses challenges for automated analysis.

Large-language models (LLMs) like GPT-4 show promise in extracting explicit details such as medications but often struggle with implicit contextual understanding, vital for nuanced medical decision-making. Variability in documentation styles across providers adds to the complexity.

Existing research demonstrates LLMs’ potential for processing free-text medical data, including decoding abbreviations and extracting social determinants of health, yet these studies primarily focus on English-language notes.

Further research is crucial to enhance LLMs’ ability to handle complex tasks, improve contextual reasoning, and assess performance across multiple languages and settings.

About the study

The present retrospective model-evaluation study involved eight university hospitals from four countries: the United States of America (USA), Colombia, Singapore, and Italy.

Participating institutions were part of the 4CE Consortium. They included Boston Children’s Hospital, the University of Michiganthe University of Wisconsin, the National University of Singapore, the University of Kansas Medical Center, the University of Pittsburgh Medical Center, Universidad de Antioquia, and Istituti Clinici Scientifici Maugeri.

The Department of Biomedical Informatics at Harvard University served as the coordinating center. Each site contributed seven de-identified medical notes, written between February 1, 2020, and June 1, 2023, resulting in a total of 56 medical notes, with six sites submitting notes in English, one in Spanish, and one in Italian.

Participating sites selected notes based on suggested criteria, including patients aged 18-65 years with a diagnosis of obesity and coronavirus disease 2019 (COVID-19) at admission. Adherence to these criteria was optional.

Submitted notes included admission, progress, and consultation notes but no discharge summaries. Notes were de-identified following US Health Insurance Portability and Accountability Act guidelines, regardless of the country of origin.

The study used GPT-4’s API in Python to analyze medical notes via a predefined question-answer framework. Parameters such as temperature, top-p, and frequency penalty were adjusted to optimize performance.

Physicians evaluated responses in free text and indicated whether they agreed with GPT-4’s answers. They were masked to each other’s evaluations but not to GPT-4’s responses.

Statistical analyses were performed to assess agreement between GPT-4 and physicians, exploring cases of disagreement and categorizing errors as extraction, inference, or hallucination issues.

Subgroup analyses and sensitivity analyses addressed variations in accuracy, such as differences in language and specific inclusion criteria.

The study highlighted GPT-4’s ability to process medical notes in multiple languages but noted challenges in contextual inference and variability in documentation styles. Data analyses were conducted in RStudio, and no external funding supported the study.

Study results

A total of 56 medical notes were collected from eight sites across four countries: the USA, Colombia, Singapore, and Italy. Of these, 42 (75%) notes were in English, seven (13%) in Italian, and seven (13%) in Spanish. For each note, GPT-4 generated responses to 14 predefined questions, resulting in 784 responses.

Among these, both physicians agreed with GPT-4 in 622 (79%) responses, one physician agreed in 82 (11%) responses, and neither agreed in 80 (10%) responses. When the National University of Singapore’s data was excluded, agreement rates remained similar: 534 (78%) responses had dual agreement, 82 (12%) had partial agreement, and 70 (10%) had no agreement.

Physicians were more likely to agree with GPT-4 for Spanish (86/98, 88%) and Italian (82/98, 84%) notes than for English notes (454/588, 77%).

The type or length of the notes did not influence agreement rates. In cases where only one physician agreed with GPT-4 (82 responses), 59 (72%) disagreements arose from inference issues, such as differing interpretations of implicit information.

In one instance, a physician inferred a patient did not have COVID-19 based on a “recent COVID-19 infection” note, while GPT-4 left the status as indeterminate. Extraction problems accounted for 8 (10%) of these disagreements, such as a physician overlooking a documented medical history that GPT-4 identified. Differences in the level of agreement accounted for the remaining 15 (18%) cases.

In responses where both physicians disagreed with GPT-4 (80 responses), inference issues were most common (47/80, 59%), followed by extraction errors (23/80, 29%) and hallucinations (10/80, 13%).

For example, GPT-4 sometimes failed to link complications, like multisystemic inflammatory syndrome, as being related to COVID-19, a connection made by both physicians. Hallucination issues included GPT-4 fabricating information not present in the notes, such as incorrectly asserting that a patient had COVID-19 when it was not mentioned.

When assessing GPT-4’s ability to select patients for hypothetical study enrolment based on four inclusion criteria (age, obesity, COVID-19 status, and admission note type), its sensitivity varied. GPT-4 demonstrated high sensitivity for obesity (97%), COVID-19 (96%), and age (94%) but lower specificity for admission notes (22%).

When the admission note criterion was excluded, GPT-4 accurately identified all three remaining criteria in 90% of cases.

Conclusions

To summarize, the study demonstrated that GPT-4 accurately analyzed medical notes in English, Italian, and Spanish, even without prompt engineering.

Surprisingly, it performed better with Italian and Spanish notes than English, possibly due to the greater complexity of US medical notes, although note length did not influence performance. GPT-4 effectively extracted explicit information, but its main limitation was inferring implicit details.

This aligns with prior findings that medical task-optimized models may overcome such challenges. While GPT-4 excelled in identifying explicit study inclusion criteria like age and obesity, it struggled to classify admission notes, likely due to reliance on implicit structural cues.

Source link : News-Medica

Leave A Reply

Your email address will not be published.