AI models boost detection of cognitive decline in medical records, study finds

Combining large language models with traditional methods enhances accuracy in identifying early signs of cognitive decline, offering new hope for early diagnosis.

Study: Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. Image Credit: MarutStudio/Shutterstock.com

A recent study in eBioMedicine evaluated the effectiveness of large language models (LLMs) in identifying cognitive decay signs in electronic health records (EHRs).

Background

Alzheimer’s disease and associated dementias afflict millions of individuals, lowering their quality of life and incurring financial and emotional costs. Early identification of cognitive deterioration might lead to more effective therapy and a higher level of care.

LLMs have demonstrated encouraging results in several healthcare domains and clinical language processing tasks, including information extraction, entity recognition, and question-answering. However, their efficacy in detecting specific clinical disorders, such as cognitive decline, using electronic health information is questionable.

Few studies have evaluated EHR data using LLMs on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud computing systems. Minimal research has compared large language models to traditional artificial intelligence (AI)-based approaches such as machine learning and deep learning. This type of research may influence model augmentation techniques.

About the study

In the present study, researchers investigated early detection of progressive cognitive decay using large language models and EHR data. They also compared the performance of large language models with conventional models trained with domain-specific data.

The researchers analyzed proprietary and open-source LLMs at Boston’s Mass General Brigham. They studied medical notes from four years before a 2019 mild cognitive impairment (MCI) diagnosis among individuals aged ≥50 years.

The International Classification of Diseases, tenth revision, clinical modification (ICD-10-CM) determined MCI. The team excluded transient, reversible, and recovering cases of cognitive decline.

Cloud computing systems that are compliant with the HIPAA Act enable prompts for GPT-4 (proprietary) and Llama 2 (open-source).

Prompt-augmentation methods like error analysis instructions, retrieval augmented generation (RAG), and hard prompting enabled LLM development. Hard-type prompting selections included random, targeted, and K-means clustering-aided selections.

Baseline study models included XGBoost and attention-based deep neural networks (DNN). The DNN framework included bidirectional long-short-term memory (LSTM) networks. Based on the performance, researchers selected the best LLM-based approach.

They constructed a three-model ensemble based on majority votes. They used confusion matrix scorings to evaluate model performance. The team used an intuitive manual template engineering method to fine-tune task descriptions. Additional task guidance enhanced LLM reasoning.

Results

The study dataset comprised 4,949 clinical note sections of 1,969 individuals, among whom 53% were female with a mean age of 76 years. Cognitive function keywords filtered the notes to develop study models. The testing dataset without keyword filtering comprised 1,996 sections of clinical notes from 1,161 individuals, among whom 53% were female with a mean age of 77 years.

The team found GPT-4 more accurate and efficient than Llama 2. However, GPT-4 could not outperform conventional models trained with domain-specific and local EHR data. The error profiles of large language models trained using general domains, machine learning, or deep learning were quite distinct; merging them into an ensemble dramatically improved performance.

The ensemble study model attained 90% precision, 94% recall, and a 92% F1 score, outperforming all individual study models regarding all performance metrics with statistically significant results.

Of note, compared to the most accurate individual model, the ensemble study increased precision from under 80% to over 90%. Error analysis showed that the minimum of one model incorrectly predicted 63 samples.

However, across all models, there were only two cases of mutual errors (3.20%). The findings indicated the diversity in error profiles across the models. The dynamic RAG method with five-shot prompting and error-based instructions yielded the best results.

GPT-4 highlighted dementia therapy options like Aricept and donepezil. It also detected diagnoses like mild neurocognitive disorders, major neurocognitive disorders, and vascular dementia better than previous models. GPT-4 addressed the emotional and psychological consequences of cognitive problems, such as anxiety, often disregarded by other models.

Unlike conventional models, GPT-4 can handle ambiguous phrases and analyze sophisticated information without confusing negations and contextual factors. However, GPT-4 may occasionally overinterpret or be overly cautious, ignoring the underlying reasons for clinical occurrences. Both GPT-4 and attention-based DNNs occasionally misinterpret clinical test findings.

Conclusions

Based on the study findings, large language models and traditional AI models trained on electronic health records had different error profiles. Combining three models into the ensemble study model improved diagnostic performance.

The study findings indicate that LLMs trained using general domains need additional development to improve clinical decision-making. Future studies should combine LLMs with more localized models, using medical information and domain expertise to improve model performance for specific tasks and experimenting with prompting and fine-tuning tactics.

Source link : News-Medica

Leave A Reply

Your email address will not be published.