When AI Outperforms Humans: A Lesson from Medical Text Summarization

- Updated Mar 14, 2024
Illustration: © AI For All
The Importance of Accurate Summaries in Healthcare
Generative AI-produced summaries can be a huge time-saver. We’ve seen this serve as an aid to draft emails and papers and research in-depth topics and more. The results, despite needing to be verified, often provide a baseline to get started at the very least. In healthcare, this is of significant value to busy medical professionals.
But for healthcare users, there’s a lot more at stake than an email that needs revisions. If a summary fails to generate accurate or appropriate content, the user can simply refine their prompt and move on. A physician, on the other hand, could be compromising the quality of care and the consequences could be life-changing.
Evaluating AI-Generated Summaries: Surprising Findings
In a recent paper, “Summarization is (almost) dead,” the authors tackle this topic head-on, evaluating the zero-shot performance of Large Language Models (LLMs) on five text summarization tasks including single news, multi-news, dialogue, software code, and English to-Chinese translation. From this emerged two main findings, and they might be surprising.
First, despite the hype around AI-generated misinformation, LLM-generated summaries hallucinate less than human-generated ones. LLMs also outperform humans in factuality. In fact, human-written reference summaries exhibit an equal or higher number of hallucinations compared to LLM-generated summaries.
To better understand this observation, the authors investigated the types of factual errors by dividing them into two categories: intrinsic and extrinsic hallucinations. Intrinsic hallucinations refer to inconsistencies between the factual information in the summary and the source text. Alternatively, extrinsic hallucinations occur when the summary includes certain factual information that is not present in the source text.
Analyzing the proportion of intrinsic and extrinsic hallucinations, there was a notably higher occurrence of extrinsic hallucinations in tasks where human-written summaries demonstrated poor factual consistency. In other words, humans had a higher tendency to include facts that did not appear in the source text in their summaries.
The Preference for AI-Generated Summaries in Healthcare
The authors sampled 100 summarization-related papers published in various academic journals in the previous three years to find the main contribution of nearly 70 percent of papers was to propose a summarization approach and validate its effectiveness on standard datasets. Given the superior performance of LLMs to these fine-tuned approaches, the authors raised a legitimate question: is text summarization still an open problem worthy of academic research?
This next finding may put the nail in the coffin. Physicians also preferred LLM-generated clinical text summaries over human-written ones. Another paper, “Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts,” published just a month later from Stanford University, resulted in similar findings.
The study applied domain adaptation methods to eight LLMs, spanning six datasets, and four clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Summaries from the best-adapted LLMs were preferable to human summaries in terms of completeness and correctness.
The results were statistically significant across almost all task types and evaluation dimensions. This suggests that when LLMs are adapted to the healthcare domain, they also produce summaries that are better in terms of both accuracy and factuality. This brings us to our next question: is summarization really dead as an open problem?
Future Directions: Summarization as a Solved Problem?
From an academic perspective, once a problem is considered “solved,” the next gap to close is productizing it. In this case, that means providing the healthcare industry with reliable off-the-shelf solutions. Today, there's software for everything, so decision-makers should look for several things when evaluating solutions.
First, make sure the models and code base are production-grade, actively maintained, and improved as new models become available. These should also be tuned for different medical document types and use cases, validated on real-world data, optimized to run efficiently and at scale, and designed to run privately in a high-compliance environment.
We’re already seeing healthcare organizations like The US Department of Veterans Affairs (VA) put this to work. Serving over 9 million veterans and their families, The VA has vast amounts of electronic medical records containing both structured and unstructured text about patients. It’s also messy, incomplete, inconsistent, duplicative, and requires dedicated time from doctors and data professionals to get answers.
To reduce the burden, The VA has applied healthcare-specific LLMs for data discovery from patient notes and stories at scale. While summarizing raw progress and discharge notes using general-purpose LLMs, the organization experienced unacceptably low accuracy. However, using specific medical text summarization models improved accuracy significantly.
There’s still work to do when it comes to adapting LLMs to more clinical scenarios, care settings, and document types. There’s also room for improvement in the areas of evaluation metrics since summaries are tricky to analyze objectively. That said, we can count on better-than-human summaries to be available and in production today. So, is summarization dead? Not quite yet—but its days are numbered.
Generative AI
David Talby, PhD, MBA, is the CTO of John Snow Labs. He has spent his career making AI, big data, and data science solve real-world problems in healthcare, life science, and related fields.
David Talby, PhD, MBA, is the CTO of John Snow Labs. He has spent his career making AI, big data, and data science solve real-world problems in healthcare, life science, and related fields.