R2D3 recently had a fantastic Visual Introduction to Machine Learning, using the classification of homes in San Francisco vs. New York as their example. As they explain quite simply:
In machine learning, computers apply statistical learning techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions.You should really head over there right now to view it, because it's very impressive.
Computational neuroscience types are using machine learning algorithms to classify all sorts of brain states, and diagnose brain disorders, in humans. How accurate are these classifications? Do the studies all use separate training sets and test sets, as shown in the example above?
Let's say your fMRI measure is able to differentiate individuals with panic disorder (n=33) from those with panic disorder + depression (n=26) with 79% accuracy.1 Or with structural MRI scans you can distinguish 20 participants with treatment-refractory depression from 21 never-depressed individuals with 85% accuracy.2 Besides the issues outlined in the footnotes, the “reality check” is that the model must be able to predict group membership for a new (untrained) data set. And most studies don't seem to do this.
I was originally drawn to the topic by a 3 page article entitled, Machine learning algorithm accurately detects fMRI signature of vulnerability to major depression (Sato et al., 2015). Wow! Really? How accurate? Which fMRI signature? Let's take a look.
- machine learning algorithm = Maximum Entropy Linear Discriminant Analysis (MLDA)
- accurately predicts = 78.3% (72.0% sensitivity and 85.7% specificity)
- fMRI signature = “guilt-selective anterior temporal functional connectivity changes” (seems a bit overly specific and esoteric, no?)
- vulnerability to major depression = 25 participants with remitted depression vs. 21 never-depressed participants
Nor did they try to compare individuals who are currently depressed to those who are currently remitted. That didn't matter, apparently, because the authors suggest the fMRI signature is a trait marker of vulnerability, not a state marker of current mood. But the classifier missed 28% of the remitted group who did not have the “guilt-selective anterior temporal functional connectivity changes.”
What is that, you ask? This is a set of mini-regions (i.e., not too many voxels in each) functionally connected to a right superior anterior temporal lobe seed region of interest during a contrast of guilt vs. anger feelings (selected from a number of other possible emotions) for self or best friend, based on written imaginary scenarios like “Angela [self] does act stingily towards Rachel [friend]” and “Rachel does act stingily towards Angela” conducted outside the scanner (after the fMRI session is over). Got that?
You really need to read a bunch of other articles to understand what that means, because the current paper is less than 3 pages long. Did I say that already?
modified from Fig 1B (Sato et al., 2015). Weight vector maps highlighting voxels among the 1% most discriminative for remitted major depression vs. controls, including the subgenual cingulate cortex, both hippocampi, the right thalamus and the anterior insulae.
The patients were previously diagnosed according to DSM-IV-TR (which was current at the time), and in remission for at least 12 months. The study was conducted by investigators from Brazil and the UK, so they didn't have to worry about RDoC, i.e. “new ways of classifying mental disorders based on behavioral dimensions and neurobiological measures” (instead of DSM-5 criteria). A “guilt-proneness” behavioral construct, along with the “guilt-selective” network of idiosyncratic brain regions, might be more in line with RDoC than past major depression diagnosis.
Could these results possibly generalize to other populations of remitted and never-depressed individuals? Well, the fMRI signature seems a bit specialized (and convoluted). And overfitting is another likely problem here...
In their next post, R2D3 will discuss overfitting:
Ideally, the [decision] tree should perform similarly on both known and unknown data.
So this one is less than ideal. [NOTE: the one that's 90% in the top figure]
These errors are due to overfitting. Our model has learned to treat every detail in the training data as important, even details that turned out to be irrelevant.
In my next post, I'll present an unsystematic review of machine learning as applied to the classification of major depression. It's notable that Sato et al. (2015) used the word “classification” instead of “diagnosis.”3
ADDENDUM (Aug 3 2015): In the comments, I've presented more specific critiques of: (1) the leave-one-out procedure and (2) how the biomarker is temporally disconnected from when the participants identify their feeling as 'guilt' or 'anger' or etc. (and why shame is more closely related to depression than guilt).
Footnotes
1 The sensitivity (true positive rate) was 73% and the specificity (true negative rate) was 85%. After correcting for confounding variables, these numbers were 77% and 70%, respectively.
2 The abstract concludes this is a “high degree of accuracy.” Not to pick on these particular authors (this is a typical study), but Dr. Dorothy Bishop explains why this is not very helpful for screening or diagnostic purposes. And what you'd really want to do here is to discriminate between treatment-resistant vs. treatment-responsive depression. If an individual does not respond to standard treatments, it would be highly beneficial to avoid a long futile period of medication trials.
3 In case you're wondering, the title of this post was based on The Dark Side of Diagnosis by Brain Scan, which is about Dr Daniel Amen. The work of the investigators discussed here is in no way, shape, or form related to any of the issues discussed in that post.
Reference
Sato, J., Moll, J., Green, S., Deakin, J., Thomaz, C., & Zahn, R. (2015). Machine learning algorithm accurately detects fMRI signature of vulnerability to major depression Psychiatry Research: Neuroimaging DOI: 10.1016/j.pscychresns.2015.07.001
Yikes on that sens/spec. But also, am reminded of my usual shoutout to https://www.psych.umn.edu/faculty/grove/112clinicalversusstatisticalprediction.pdf
ReplyDeleteOver time, I suspect clinical prediction and classification will be supplanted by algorithm, although currently our rates are better.
Thanks for that link.
ReplyDeleteI'm not sure how some of these articles can call the results "accurate" or say they have "a high degree of accuracy."
Next time, I'll raise the point that people working in this field should all try out everyone else's algorithms on their own datasets. But the “choose one feeling that (they) would feel most strongly” from the following list: guilt, contempt/disgust towards self, shame, indignation/anger towards self, indignation/anger towards others, contempt/disgust towards others, none, other feeling task is rather limiting...
How many features does each data point have? Surely it's a nonsense to train a classifier on such a small data set if they are using more than a couple of features even before the failure to use a test data set?
ReplyDeleteTraining a simple classifier like LDA for nearly 50 data points isn't nonsense - why would it be? Yes, overfitting. That's what the cross-validation is for. Also they ran the classifier without the guilt vs. indignation marker and the performance dropped dramatically, which doesn't really speak for mere overfitting either.
ReplyDeleteCross-validation is good and standard practice for small data sets, so I don't understand your point about not having tested an "independent group". What does that mean? Do you mean that the same participants have been imaged in another study before? In that case I agree that this might introduce a bias, but for exploratory research this seems perfectly reasonable, especially as they say they are carrying out a validation study right now (with other participants, I'm sure).
I also don't understand why you summarize the procedure for their bio marker in a way that makes it sound complicated and questionable without concretely criticizing anything about it. This is just unprofessional. And I don't think what they describe is very complicated.
This blog is generally a fun read, but this post has been disappointing.
Anonymouse - Thanks for your comment. Others who are much more expert in this realm (@NeuroStats) have pointed to problems with the leave-one-out procedure and linked to this paper (PDF):
ReplyDeleteA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (1995), by Ron Kohavi.
As well: remarked: "Also leave one out cross validation should be abandoned in favor of 5-fold or 10-fold CV."
and: "The ML community realized this 2 decades ago. Test set size needs to be representative of overall heterogeniety"
I didn't engage in an in-depth critique of the biomarker itself because that was beyond the scope of the post. I didn't have time to read 3-4 other papers to determine how and why they chose a contrast of "guilt" vs. "indignation/anger towards others" from this long list of possible answers: guilt, contempt/disgust towards self, shame, indignation/anger towards self, indignation/anger towards others, contempt/disgust towards others, none, other feeling.
Why guilt and not shame, which is the more toxic emotion, according to many researchers and clinicians?
"Shame-proneness was strongly related to psychological maladjustment in general. Guilt-proneness was only moderately related to psychopathology; correlations were ascribable entirely to the shared variance between shame and guilt. Although clearly related to a depressogenic attributional style, shame accounted for substantial variance in depression, above and beyond attributional style." (Tangney et al., 1992)
To be honest, I wanted to write about one of the earlier papers by these authors over 2 yrs ago (and at subsequent times since), but found that an adequate review of the clinical and neuroimaging literatures was overwhelming. Other fMRI researchers have not found the same results for guilt, for instance.
Here's an illustration of the important differences between guilt and shame:
ReplyDeleteAnonymouse: "I also don't understand why you summarize the procedure for their bio marker in a way that makes it sound complicated and questionable without concretely criticizing anything about it. This is just unprofessional. And I don't think what they describe is very complicated.
This blog is generally a fun read, but this post has been disappointing."
Guilt reaction: "Huh, maybe I should have read more articles and explained the problem with the biomarker. Perhaps I can do that in a future post."
If I have wronged the authors, then I can correct this mistake and give them a more thorough reading. Guilt can be adaptive (one can correct the transgression).
Shame reaction: "OMG, I'm unprofessional and I've disappointed a reader. I'm a failure!"
I have wronged the authors and the readers, therefore I'm a terrible person unworthy of love. Just more proof that one is bad person. This reaction is much more strongly related to depression than guilt (Tangney et al., 1992).
The participants made these types of guilt, anger, shame, etc. judgments on the study stimuli (e.g., "I acted stingily towards my best friend") after the scan was over. So we don't really know what the hemodynamic response was during this sort of evaluation. During the scan, subjects rated the intensity of their feeling (“extremely unpleasant” or “mildly unpleasant”), but did not rate the specific emotion they felt. This makes the biomarker further removed from actual guilt feelings, because the participants were busy making a decision about their generic unpleasantness level.
If I'm understanding everything correctly, I believe this is a more specific critique...
I generally agree, 10-Fold-CV should be used instead of LOOCV, but due to the small size of the data set, I doubt that the results would look much different from LOOCV (they should have at least reported the SD). So I don't think this is a reason to dismiss their results, especially not by suggesting that they are not "reality check[ing]" their system - they are, they could just be doing a better job. (I don't expect brain imaging people to do a good job when it comes to statistics.)
ReplyDeleteI don't know about the particular choice of emotions, that wasn't the point I criticized (not my field of study or interest). It sounds plausible that the bio marker itself is debatable, I just didn't like the way you presented it.
PS: You should really add disqus to your blog.