This Neuroimaging Method Has 100% Diagnostic Accuracy (or your money back)
doi:10.1371/journal.pone.0129659.g003
Did you know that SPECT imaging can diagnose PTSD with 100% accuracy (Amen et al., 2015)? Not only that, out of a sample of 397 patients from the Amen Clinic in Newport Beach, SPECT was able to distinguish between four different groups with 100% accuracy! That's right, the scans of (1) healthy participants, and patients with (2) classic post-traumatic stress disorder (PTSD), (3) classic traumatic brain injury (TBI), and (4) both disorders..... were all classified with 100% accuracy!
TRACK-TBI investigators, your 3T structural and functional MRI outcome measures are obsolete.
NIMH, the hard work of developing biomarkers for mental illness is done, you can shut down now. Except none of this research was funded by you...
The finding was #19 in a list of the top 100 stories by Discover Magazine.
How could the Amen Clinics, a for-profit commercial enterprise, accomplish what an army of investigators with billions in federal funding could not?
The authors1 relied on a large database of scans collected from multiple sites over a 20 year period. The total sample included 20,746 individuals who visited one of nine Amen Clinics from 1995-2014 for psychiatric and/or neurological evaluation (Amen et al., 2015). The first analysis included a smaller, highly selected sample matched on a number of dimensions, including psychiatric comorbidities (Group 1).
- click on image for larger view -
You'll notice the percentage of patients with ADHD was remarkably high (58%, matched across the three patient groups). Perhaps that's because...
I did not know that.
Featuring Johnny Cash ADD.
SPECT uses a radioactive tracer injected 30 minutes before a scan that will assess either the “resting state” or an “on-task” condition (a continuous performance task, in this study). Clearly, SPECT is not the go-to method if you're looking for decent temporal resolution to compare two conditions of an active attention task. The authors used a region of interest (ROI) analysis to measure tracer activity (counts) in specific brain regions.
I wondered about the circularity of the clinical diagnosis (i.e., were the SPECT scans used to aid diagnosis), particularly since “Diagnoses were made by board certified or eligible psychiatrists, using all of the data available to them, including detailed clinical history, mental status examination and DSM-IV or V criteria...” But we were assured that wasn't the case: “These quantitative ROI metrics were in no way used to aid in the clinical diagnosis of PTSD or TBI.” The rest of the methods (see Footnote 2) were opaque to me, as I know nothing about SPECT.
A second analysis relied on visual readings (VR) of about 30 cortical and subcortical ROIs. “Raters did not have access to detailed clinical information, but did know age, gender, medications, and primary presenting symptoms (ex. depressive symptoms, apathy, etc.).” Hmm...
But the quantitative ROI analysis gave superior results to the clinician VR. So superior, in fact, that the sensitivity/specificity in distinguishing one group from another was 100% (indicated by red boxes below). The VR distinguished patients from controls with 100% accuracy, but was not as good for classifying the different patient groups during the resting state scan — only a measly 86% sensitivity, 81% specificity for TBI vs. PTSD, which is still much better than other studies. However, results from the massively sized Group 2 were completely unimpressive. 3
- click on image for larger view, you'll want to see this -
Why is this so important? PTSD and TBI can show overlapping symptoms in war veterans and civilians alike, and the disorders can co-occur in the same individual. More accurate diagnosis can lead to better treatments. This active area of research is nicely reviewed in the paper, but no major breakthroughs have been reported yet. So the claims of Amen et al. are remarkable. Stunning if true. But they're not. They can't be. The accuracy of the classifier exceeds the precision of the measurements, so this can't be possible. What is the test-retest reliability of SPECT? What is the concordance across sites? Was there no change in imaging protocol, no improvements or upgrades to the equipment over 20 years? SPECT is sensitive to motion artifact, so how was that handled, especially in patients who purportedly have ADHD?
SPECT has been noted for its poor spatial resolution compared to other functional neuroimaging techniques like PET and fMRI. A panel of 16 experts did not include SPECT among the recommended imaging modalities for the detection of TBI. Dr. Amen and his Clinics in particular have been criticized in journals (Farah, 2009; Adinoff & Devous, 2010a, 2010b; Chancellor &, Chatterjee, 2011) and blogs (Science-Based Medicine, The Neurocritic, and Neurobollocks) for making unsubstantiated claims about the diagnostic accuracy and usefulness of SPECT.
Are his latest results too good to be true? You can check for yourself! The paper was published in PLOS ONE, which has an open data policy:
PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.
When submitting a manuscript online, authors must provide a Data Availability Statement describing compliance with PLOS's policy. If the article is accepted for publication, the data availability statement will be published as part of the final article.
Before you get too excited, here's the Data Availability Statement:
Data Availability: All relevant data are within the paper.
But this is not true. NONE of the data are available within the paper. There's no way to reproduce the authors' analyses, or to conduct your own. This is a problem, because...
Refusal to share data and related metadata and methods in accordance with this policy will be grounds for rejection. PLOS journal editors encourage researchers to contact them if they encounter difficulties in obtaining data from articles published in PLOS journals. If restrictions on access to data come to light after publication, we reserve the right to post a correction, to contact the authors' institutions and funders, or in extreme cases to retract the publication.
So all you “research parasites” out there4 — you can request the data. I thought this modest proposal would create a brouhaha until I saw a 2014 press release announcing the World's Largest Database of Functional Brain Scans Produces New Insights to Help Better Diagnose and Treat Mental Health Issues:
With a generous grant from the Seeds Foundation [a Christian philanthropic organization] in Hong Kong, Dr. Amen and his research team led by neuroscientist Kristen Willeumier, PhD, have turned the de-identified scans and clinical information into a searchable database that is shared with other researchers around the world.
In the last two years, Amen and colleagues have presented 20 posters at the National Academy of Neuropsychology. The PR continues:
The magnitude and clinical significance of the Amen Clinics database – being the world's largest SPECT imaging database having such volume and breadth of data from patients 9 months old to 101 years of age – makes it a treasure trove for researchers to help advance and revolutionize the practice of psychiatry.
Does this mean that Dr. Amen will grant you access to the PLOS ONE dataset (or to the entire Amen Clinics database) if you ask nicely? If anyone tries to do this, please leave a comment.
Footnotes
1 The other authors included Dr. Andrew “Glossolalia” Newberg and Dr. Theodore “Neuro-Luminance Synaptic Space” Henderson.
2 Methods:
To account for outliers, T-score derived ROI count measurements were derived using trimmed means [91] that are calculated using all scores within the 98% confidence interval (-2.58 < Z < -2.58). The ROI mean for each subject and the trimmed mean for the sample are used to calculate T with the following formula: T = 10*((subject ROI_mean - trimmed regional_avg)/trimmed regional_stdev)+50.3 Results from the less pristine Group 2 were not impressive at all, I must say. Group 2 had TBI (n=7,505), PTSD (n=1,077), or both (n=1,017) compared to n=11,147 patients without either (these were not clean controls as in Group 1). Given the massive number of subjects, the results were clinically useless, for the most part (see Table 6).
4 A brand new editorial in NEJM by Longo and Drazen (who decry “research parasites”) is causing a twitterstorm with the hashtags #researchparasites and #IAmAResearchParasite.
References
Adinoff B, Devous M. (2010a). Scientifically unfounded claims in diagnosing and treating patients. Am J Psychiatry 167(5):598.
Adinoff B, Devous M. (2010b). Response to Amen letter. Am J Psychiatry 167(9):1125-1126.
Amen, D., Raji, C., Willeumier, K., Taylor, D., Tarzwell, R., Newberg, A., & Henderson, T. (2015). Functional Neuroimaging Distinguishes Posttraumatic Stress Disorder from Traumatic Brain Injury in Focused and Large Community Datasets PLOS ONE, 10 (7) DOI: 10.1371/journal.pone.0129659
Chancellor B, Chatterjee A. (2011). Brain branding: When neuroscience and commerce collide. AJOB Neuroscience 2(4): 18-27.
Farah MJ. (2009). A picture is worth a thousand dollars. J Cogn Neurosci. 21(4):623-4.
Subscribe to Post Comments [Atom]
15 Comments:
You wrote: "So the claims of Amen et al. are remarkable. Stunning if true. But they're not. They can't be. The accuracy of the classifier exceeds the precision of the measurements, so this can't be possible."
Accuracy is a function of TP, FP, TN, and FN. Measurement precision is not part of the calculation. And if FP=FN=0, then accuracy = 100% by definition.
In other words, if you call a coin flip heads and it turns up heads, then your accuracy for that coin flip is precisely 100%. Naturally, your results for the coin flip are limited in generalizability due to a sample size of one. But despite the caveats, the accuracy is still 100%. What else could it possibly be?
And likewise there are caveats for any scientific paper. But why you belabor them for Amen et al? If they had reported an accuracy of 67%, would you likewise have been compelled to point out that their accuracy may turn out to be a different value if/when their method is repeated by others? After all, the same issues of motion artifact, etc are equally applicable regardless of what value of accuracy is reported. And most papers reporting SPECT results expect their readers to understand these methodological issues implicitly, without a giant disclaimer at the end regarding the modality.
In short, does a paper reporting 100% accuracy require special scrutiny that is not required of a paper reporting 67% accuracy? If so, then I suggest that getting worked up over a result of 100%, but not 67%, is nothing more than numerological superstition.
I realize that accuracy is a function of TP, FP, TN, and FN, but I was questioning HOW sensitivity and specificity could be so high, when no other neuroimaging method has come close to that. I wasn't referring to precision in the statistical sense. The medical point is, the doctors were never wrong. This is a claim that impacts patients' lives, and my concern about whether it's true isn't mere numerological superstition.
And if the entire exercise wasn't entirely circular (using scans for clinical diagnosis, then finding scans agree 100% with clinical diagnosis), as Amen et al. told us it was not, then SPECT is completely superfluous, as it added nothing to the psychiatrist's clinical diagnosis! Or as someone on Twitter said:
"if it's 100% vs the clinical diagnosis it is completely redundant. (Dr Amen, so impressed by this insight closes his clinic)"
One can also wonder why the accuracy of SPECT was perfect in the smaller "selected sample" (extreme overfitting? how were they selected? why was the incidence of ADHD so high?), but not so perfect when analyzing the entire database. In Table 6, we can see that for TBI vs. PTSD sensitivity on-task = 82% and specificity on-task = 60%.
"no other neuroimaging method has come close to that"
There have been relatively few studies using HMPAO-SPECT to study TBI/PTSD. Most have used fMRI, DTI, and/or structural MRI. There is no reason to suppose that HMPAO-SPECT should have the same results as the other modalities, it relies on completely different principles.
"The medical point is, the doctors were never wrong."
And how many times must doctors be wrong before you'll believe them?
"SPECT is completely superfluous, as it added nothing to the psychiatrist's clinical diagnosis"
Well, if Amen were declaring victory and closing his research program, then I agree it was a waste of time. But presumably, he will follow this up with studies using the same methodology to evaluate asymptomatic patients at risk, therapeutic responders vs nonresponders, and other clinically useful predictors. You know, the same formula used as "Future Directions" for pretty much every analogous grant proposal in the MRI world.
"One can also wonder why the accuracy of SPECT was perfect in the smaller "selected sample""
The "smaller sample" was still much larger than most comparable imaging studies of TBI and PTSD. And it was apparently designed to address some of the very objections you initially raised regarding sources of variability: they all came from just one site, and their medical history is more tightly controlled. More importantly, they were compared to matched healthy controls, whereas the larger sample compared patients PTSD & TBI to random patients who had SPECT scans for other reasons. No one should be surprised that results are cleaner in the smaller sample.
Overall, the methodology used in this paper is pretty run-of-the-mill for neuroimaging studies, except that (1) they used HMPAO-SPECT instead of *MRI, and (2) they achieved much better results. Oh, and their sample size was pretty good, even for the subgroup. But if you insist on smugly dampening people's enthusiasm for interesting results, then I suppose you could point out that like almost every other neuroimaging study there was no independent validation group. Without a doubt, if someone bothers to repeat the study they will find that accuracy drops a little. If you read between the lines, even the authors you mock acknowledge this, because that is the nature of this type of analysis. And when 100% drops to 90%, we can finally return to feeling comfortably dismal about the prospects of neuroimaging in helping people with psych disease. Right?
Disclosure: I use MRI to study patients with TBI and PTDS, I think this study is quite interesting, and I don't give a hoot whether reported accuracy is 90% or 100%.
The claim of 100% accuracy has huge diagnostic and commercial implications (if true), so I'm not sure why you don't care about that. Or that patients might be mislead by dubious claims in advertising, as noted by less "smug" and more respectable commenters like Farah, 2009; Adinoff & Devous, 2010a, 2010b; Chancellor &, Chatterjee, 2011 and the former NIHM Director.
Further, would you not expect a claim of a "clinically relevant range" of sensitivity and specificity in a patient population of over 20,000 to exceed 82% and 60%, respectively? (see BishopBlog).
This blog is called The Neurocritic. What you may consider "dampening people's enthusiasm for interesting results" I call "subjecting amazing new findings to scrutiny." But I'm not a complete naysayer on the usefulness of neuroscience research (or of brain imaging in particular) to improve the diagnostic accuracy of TBI and psychiatric disorders. It's not like I've never considered or discussed important issues related to neuroimaging biomarkers and machine learning, without regard to Amen. And I will own up to my mistakes here if I'm ultimately proven wrong and HMPAO-SPECT is a superior diagnostic tool to FDG-PET and *MRI. However, the general consensus is that FDG-PET is more accurate than HMPAO-SPECT.
e.g., "We recommend (18)F-FDG PET be performed instead of perfusion SPECT for the differential diagnosis of degenerative dementia if functional imaging is indicated." (e.g., O'Brien et al. 2014).
"FDG-PET is quantitatively more accurate and thus better suited to multicenter studies than perfusion SPECT." (Herholz, 2011).
A recent review of FDG-PET in mild TBI stated that "...10 published reports evaluating FDG-PET after mTBI, excluding cases of complicated mTBI where damage was observed on the CT or MRI scan after an apparent mild injury ... demonstrate varying degrees of sensitivity to detection at acute, subacute, and chronic phases of injury."
Finally, a recent FDG-PET study (Petrie et al., 2014) found that FDG-PET was not able to distinguish between participants with versus without PTSD. As you mentioned, however, the number of patients in Amen's database vastly exceeds the number in most studies.
Ultimately the decision to publish in an open data journal like PLOS ONE could propel the field forward if independent investigators are able to validate Amen's findings. If you think HMPAO-SPECT is the future, perhaps you can consider redirecting your research efforts away from MRI if that's the best way to help patients with TBI and PTSD. Or at the very least request Amen et al.'s dataset to analyze yourself, since you think this study is quite interesting.
Scientific results should certainly be subject to scrutiny. My concern is that you subject "amazing" studies to different scrutiny than the rest. If instead you compare this study to similar studies with less amazing results, I believe that you would find that the methodology is pretty similar in some respects, and better (sample size) in others. That's why I wondered whether you are roundly critical of neuroimaging, or biased against certain results. If the latter, keep in mind that one hundred percent accuracy is not unexpected once in a while. Statistically speaking, it would be unsettling if out of thousands of studies published yearly, *every single one* had FP>0 or FN>0.
Regarding the relative value of imaging modalities:
FDG-PET is clearly superior for some applications, such as characterization of dementia sub-types.
MRI is clearly superior to FDG-PET for other applications, such as detection of acute stroke and neoplasm.
CT is clearly superior to both MRI and FDG-PET for still other applications, such as detection of fracture, aneurysm, thrombosis, and sinusitis.
And HMPAO-SPECT happens to be the imaging standard for establishing brain death when preparing for organ donation.
In short, no imaging modality is a panacea, and each already has well-defined use cases. But we are talking about TBI and PTSD, where neither FDG-PET nor MRI have proven clinically effective, as supported by your own citations. And there is no reason to presuppose that they would be superior to HMPAO-SPECT, that is clearly an empirical question.
In this case, it appears HMPAO-SPECT may the best suited modality for this indication. Time will tell if this is the case. Sadly it might take longer than we would like, and you and I both know why: academics rarely get paid to reproduce someone else's work.
Finally, self-censorship out of concern for how your results will be represented by commercial interests IMHO is incompatible with the spirit of scientific inquiry. Record the numbers, report the numbers, and let the chips fall where they may.
Quick comment on statistics: how likely is it that all four measures from the ROI analysis (sensitivity on-task, sensitivity at rest, selectivity on-task, selectivity at rest) would be 100% for six different group comparisons? Could another neurologist or psychiatrist replicate with 100% accuracy the classification of 397 participants into one of four groups (based Clinician 1's diagnoses)? In other words, how can SPECT exceed the reliability of psychiatric diagnosis?
This paper didn't make any claims regarding replication, reliability, or even diagnosis (ie classification of unknowns).
Let's be clear about what the authors claim: AFTER measuring SPECT values in patients and controls, they were able to construct a metric that completely separated the patients from controls. This is quite different from claiming that the same metric must work perfectly in a NEW group of patients with unknown diagnosis. The latter claim awaits validation from an independent sample.
As an analogy, AFTER asking your family all their favorite movies, you might be able to isolate a movie that all of the females and none of the males in your family enjoyed. I bet I could, with my family. But that's quite different from claiming that movie preferences can perfectly determine a random person's gender. And it claims nothing about whether your family will have the same preferences next year.
Have you ever seen another neuroimaging metric that is able to correctly classify 100% of patients? No matter how much overfitting and lack of generalizability? I haven't seen a perfect metric before, and I find this result statistically improbable.
A priori, precisely 100% is as improbable as precisely 65%. We only attach emotional significance to the former.
Has it happened before? Sure, for example:
http://www.ncbi.nlm.nih.gov/pubmed/23236384
http://www.ncbi.nlm.nih.gov/pubmed/21839143
I'm glad you mentioned that Bonsal, Peterson et al. paper, I was thinking about that one in particular. Yes, the classification accuracies were extraordinarily high, but still, not every single one was 100%. And at the time, there was a mixture of amazement and skepticism about it.
This tweet by @fnielsen sums it up: "Twittersphere discussing high precision/recall rates for a neuroimaging classification study"
Here's another: "Accurate diagnosis of mental illness via MRI http://is.gd/gsTeeP Potentially huge, replication please"
I tweeted about it myself, asking "Anyone read & evaluated that 21 page paper claiming Anatomical Brain Images Alone Can Accurately Diagnose Chronic Neuropsychiatric Illnesses" (there's a discussion that follows).
I felt underqualified: "I'm afraid I don't know enough about Marching Cubes algorithms and Dirac delta functions to evaluate this paper myself" I don't know whether anyone ever blogged about it.
The point is, people took the Bonsal et al. paper seriously. Given the shady reputation of the Amen Clinics, few academic researchers (to my knowledge, other than you) are taking the SPECT findings seriously.
Oops, the first author's name is actually Bansal, not Bonsal...
Bansal reported sensitivity and specificity of 99.99% and 100%. As far as I'm concerned, 99.99% is within margin of error of 100%. They are scientifically indistinguishable, at least in neuroimaging.
But it sounds like the difference between Amen et al. and Bansal et al. is not 0.01%, it's the list of authors. That's fine, you don't have to trust what Amen says. But if you don't, then what's the point of reading the abstract?
Just write "Amen published another paper today, which I didn't need to read because I don't trust his work". You've already made up your mind, so be willing to admit it. No need to pretend you want a scientific analysis of methodology when it really comes down to a judgment of character.
A couple more observations regarding the Bansal vs Amen papers:
If the best performance of Amen's method were actually the same as the best performance of Bansal's method (ie only 99.99% sensitivity and 100% specificity) then you would still expect Amen to report 100% sensitivity. That's because Amen used ~100 subjects for their Group 1 analysis, and with 99.99% sensitivity you would expect zero errors. Actually, Amen's sensitivity could be ten times worse than Bansal's (ie 99.90% sensitivity) and you would still expect zero errors with their sample size. It takes at least 10000 subjects before you should start expecting errors at sensitivity = 99.99%.
Secondly, Amen reported three comparisons with perfect accuracy: Group 1 TBI vs CTL, Group 1 TBI/PTSD vs CTL, and Group 1 PTSD vs CTL. They also reported other comparisons with lower accuracy.
Likewise, Bansal reported three comparisons with near-perfect accuracy (ie you would expect no errors in a sample with n ~ 100): pediatric Tourette's vs ADHD, adult schizophrenia vs Tourett's, and adult schizophrenia vs bipolar.
So if one is willing to accept Bansal as "potentially huge", the results of Amen et al shouldn't stretch credulity.
This is getting silly. I said "not every single one is 100%." Here's the text:
"They discriminated the brains of children with ADHD from HC with 93.6% sensitivity and 88.5% specificity (Fig. 7, left); children with TS from children with ADHD with 99.83% sensitivity and 99.5% specificity (Fig. 7, right); adults with BD from HA with 100% sensitivity and 96.4% specificity (Fig. 8, 1st column); adults with SZ from adults with TS with 99.99% sensitivity and 100% specificity (Fig. 8, 2nd column); adults with SZ from adults with BD with 99.99% sensitivity and 100% specificity (Fig. 8, 3rd column); adults with SZ from healthy adults with 93.1% sensitivity and 94.5% specificity (Fig. 8, 4th column); adults with TS from HA with 83.2% sensitivity and 90% specificity (Fig. 9, left); children with TS from HC with 94.6% sensitivity and 79% specificity (Fig. 9, right); and participants at HR for depression from those at LR for depression with 81% sensitivity and 71% specificity (Fig. 10)."
Higher than anything prior to it? Yes. Shockingly high? Of course. My point was that not all discriminations had 100% sensitivity and 100% specificity.
I point to the red rectangles in Table 5, in the post, where every ROI discrimination was 100%. The visual readings (which were a separate qualitative measure, the scans as read by a human being) were less accurate at discriminating patient groups from each other (but still surprisingly high).
Not EVERY SINGLE ONE of Bansal et al.'s comparisons were 100%. That's all. Bansal et al. also went into much, much greater detail to describe every step of their analytic methods.
Post a Comment
<< Home