The study, conducted by a team of international academics from hospitals in the U.K, U.S. and Switzerland, was a review of all existing scientific literature comparing the performance of AI models and healthcare professionals published between January 2012 and June 2019.
The full results were published in The Lancet Digital Health journal on September 24.
Researchers said analysis of 14 studies comparing the performance of deep learning with that of humans found algorithms correctly detected disease in 87 percent of cases in the sample of images, compared to the 86 percent that was achieved by human experts.
The ability to accurately exclude patients who didn’t have a disease was similar between algorithms (93 percent) and healthcare experts (91 percent).
Despite the close results, researchers warned there were limitations to the study, citing a severe lack of quality analysis directly comparing the performance of human experts and AI.
“We reviewed over 20,500 articles, but less than one percent of these were sufficiently robust in their design and reporting that independent reviewers had high confidence in their claims,” said Professor Alastair Denniston from University Hospitals Birmingham, who led the research.
The U.K-based academic added: “What’s more, only 25 studies validated the AI models externally, using medical images from a different population, and just 14 studies actually compared the performance of AI and health professionals using the same test sample.
“Within that handful of high-quality studies, we found that deep learning could indeed detect diseases ranging from cancers to eye diseases as accurately as health professionals. But it is important to note that AI did not substantially out-perform human diagnosis.”
So there may be hope for humans just yet, researchers said. The report complained most studies analyzed algorithmic results “in a way that does not reflect clinical practice.”
Too soon to tell?
The paper noted that scientific studies about artificial intelligence and algorithms have typically been conducted in isolation—not considering the trove of additional clinical information that medical professionals often have to take into account before making a final diagnosis.
But it conceded: “Diagnosis of disease using deep learning algorithms holds enormous potential. From this exploratory meta-analysis, we cautiously state that the accuracy of deep learning algorithms is equivalent to health-care professionals, while acknowledging that more studies considering the integration of such algorithms in real-world settings are needed.”
Dr. Tessa Cook, from the U.S. University of Pennsylvania, said of the latest findings: “Perhaps the better conclusion is that in the narrow public body of work comparing AI to human physicians, AI is no worse than humans, but the data are sparse and it may be too soon to tell.”