Can a Computer Algorithm Identify Suicidal People from Brain Scans? The Answer Won't Surprise You
Death by suicide is a preventable tragedy if the suicidal individual is identified and receives appropriate treatment. Unfortunately, some suicidal individuals do not signal their intent, and others do not receive essential assistance. Youths with severe suicidal ideation are not taken seriously in many cases, and thus are not admitted to emergency rooms. A common scenario is that resources are scarce, the ER is backed up, and a cursory clinical assessment will determine who is admitted and who will be triaged. From a practical standpoint, using fMRI to determine suicide risk is a non-starter.
Yet here we are, with media coverage blaring that an Algorithm can identify suicidal people using brain scans and Brain Patterns May Predict People At Risk Of Suicide. These media pieces herald a new study claiming that fMRI can predict suicidal ideation with 91% accuracy (Just et al. 2017). The authors applied a complex algorithm (machine learning) to analyze brain scans obtained using a highly specialized protocol to examine semantic and emotional responses to life and death concepts.
Let me unpack that a bit. The scans of 17 young adults with suicidal ideation (thoughts about suicide) were compared to those from another 17 participants without suicidal ideation. A computer algorithm (Gaussian Naive Bayes) was trained on the neural responses to death-related and suicide-related words, and correctly classified 15 out of 17 suicidal ideators (88% sensitivity) and 16 out of 17 controls (94% specificity). Are these results too good to be true? Yes, probably. And yet they're not good enough, because two at-risk individuals were not picked up.
The computational methods used to classify the suicidal vs. control groups are suspect, according to many machine learning experts on social media. One problem is known as “overfitting” — using too many parameters taken from small populations that may not generalize to unique samples. The key metric is whether the algorithm will be able to classify individuals from independent, out-of-sample populations. And we don't know that for sure. Another problem is that the leave-one-out cross validation is problematic. I'm not an expert here, so the Twitter threads that start below (and here) are your best bet.
ML re suicide, 90% correct, 2 groups of 17. Shiny journal. Anyone see any problems ? https://t.co/mgQ8tW6s5w @tyrell_turing— KordingLab (@KordingLab) October 31, 2017
For the rest of this post, I'll raise other issues about this study that concerned me.
Why use an expensive technology in the first place?
The rationale for this included some questionable statements.
- ...predictions by both clinicians and patients of future suicide risk have been shown to be relatively poor predictors of future suicide attempt2,3.
...the implicit association of death/suicide with self was associated with an approximately 6-fold increase in the odds of making a suicide attempt in the next 6 months, exceeding the predictive validity of known risk factors (e.g., depression, suicide-attempt history) and both patients’ and clinicians’ predictions.But let's go ahead with an fMRI study that will be far more accurate than a short and easy-to-administer computerized test!
- Nearly 80% of patients who die by suicide deny suicidal ideation in their last contact with a mental healthcare professional4.
How do you measure the neural correlates of suicidal thoughts?
This is a tough one, but the authors propose to uncover the neural signatures of specific concepts, as well as the emotions they evoke:
...the neural signature of the test concepts was treated as a decomposable biomarker of thought processes that can be used to pinpoint particular components of the alteration [in participants with suicidal ideation]. This decomposition attempts to specify a particular component of the neural signature that is altered, namely, the emotional component...
How do you choose which concepts and emotions to measure?
The “concepts” were words from three different categories (although the designation of Suicide vs. Negative seems arbitrary for some of the stimuli). The set of 30 words was presented six times, with each word shown for three seconds followed by a four second blank screen. Subjects were “asked to actively think about the concepts ... while they were displayed, thinking about their main properties (and filling in details that come to mind) and attempting consistency across presentations.”
The “emotion signatures” were derived from a prior study (Kassam et al., 2013) that asked method actors to self-induce nine emotional states (anger, disgust, envy, fear, happiness, lust, pride, sadness, and shame). The emotional states selected for the present study were anger, pride, sadness, and shame (all chosen post hoc). Should we expect emotion signatures that are self-induced by actors to be the same as emotion signatures that are evoked by words? Should we expect a universal emotional response to Comfort or Evil or Apathy?
Six words (death, carefree, good, cruelty, praise, and trouble — in descending order) and five brain regions (left superior medial frontal, medial frontal/anterior cingulate, right middle temporal, left inferior parietal, and left inferior frontal) from a whole-brain analysis (that excluded bilateral occipital lobes for some reason) provided the most accurate discrimination between the two groups. Why these specific words and voxels? Twenty-five voxels, specifically. It doesn't matter.
The neural representation of each concept, as used by the classifier, consisted of the mean activation level of the five most stable voxels in each of the five most discriminating locations....and...
All of these regions, especially the left superior medial frontal area and medial frontal/anterior cingulate, have repeatedly been strongly associated with self-referential thought......and...
...the concept of ‘death’ evoked more shame, whereas the concept of ‘trouble’ evoked more sadness in the suicidal ideator group. ‘Trouble’ also evoked less anger in the suicidal ideator group than in the control group. The positive concept ‘carefree’ evoked less pride in the suicidal ideator group. This pattern of differences in emotional response suggests that the altered perspective in suicidal ideation may reflect a resigned acceptance of a current or future negative state of affairs, manifested by listlessness, defeat and a degree of anhedonia (less pride evoked in the concept of ‘carefree’) [why not less pride to 'praise' or 'superior'? who knows...]
Not that this involves circularity or reverse inference or HARKing or anything...
How can a method that excludes data from 55% of the target participants be useful??
This one seems like a showstopper. A total of 38 suicidal participants were scanned, but those who did not show the desired semantic effects were excluded due to “poor data quality”:
The neurosemantic analyses ... are based on 34 participants, 17 participants per group whose fMRI data quality was sufficient for accurate (normalized rank accuracy > 0.6) identification of the 30 individual concepts from their fMRI signatures. The selection of participants included in the primary analyses was based only on the technical quality of the fMRI data. The data quality was assessed in terms of the ability of a classifier to identify which of the 30 individual concepts they were thinking about with a rank accuracy of at least 0.6, based on the neural signatures evoked by the concepts. The participants who met this criterion also showed less head motion (t(77) = 2.73, P < 0.01). The criterion was not based on group discriminability.
This logic seems circular to me, despite the claim that inclusion wasn't based on group classification accuracy. Seriously, if you throw out over half of your subjects, how can your method ever be useful? Nonetheless, the 21 “poor data quality” ideators with excessive head motion and bad semantic signatures were used in an out-of-sample analysis that also revealed relatively high classification accuracy (87%) compared to the data from the same 17 “good” controls (the data from 24 “bad” controls were excluded, apparently).
We attribute the suboptimal fMRI data quality (inaccurate concept identification from its neural signature) of the excluded participants to some combination of excessive head motion and an inability to sustain attention to the task of repeatedly thinking about each stimulus concept for 3 s over a 30-min testing period.
Furthermore, another classifier was even more accurate (94%) in discriminating between suicidal ideators who had made a suicide attempt (n=9) from those who had not (n=8), although the out-of-sample accuracy for the excluded 21 was only 61%. Perhaps I'm misunderstanding something here, but I'm puzzled...
I commend the authors for studying a neglected clinical group, but wish they were more rigorous, didn't overinterpret their results, and didn't overhype the miracle of machine learning.
Crisis Text Line [741741 in the US] uses machine learning to prioritize their call load based on word usage and emojis. There is a great variety of intersectional risk factors that may lead someone to death by suicide. At present, no method can capture the full scope of diversity of who will cross the line.
If you are feeling suicidal or know someone who might be, here is a link to a directory of online and mobile suicide help services.
Footnote
1 I won't discuss the problematic nature of the IAT here.
References
Just MA, Pan L, Cherkassky VL, McMakin DL, Cha c, Nock MK, & Brent D (2017). Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth. Nature Human Behaviour. Published online: 30 October 2017
Kassam KS, Markey AR, Cherkassky VL, Loewenstein G, Just MA. (2013). Identifying Emotions on the Basis of Neural Activation. PLoS One. 8(6):e66032.
Nock MK, Park JM, Finn CT, Deliberto TL, Dour HJ, Banaji MR. (2010). Measuring the suicidal mind: implicit cognition predicts suicidal behavior. Psychol Sci. 21(4):511-7.
Subscribe to Post Comments [Atom]
5 Comments:
Thanks for this useful summary. I hope you will link to it from PubMed Commons so it will be found by those searching via PubMed.
There was just one thing you said that I disagreed with: 'I commend the authors for studying a neglected clinical group'. I had the opposite reaction, namely, I am concerned that the authors are taking a vulnerable group of young people, putting them in a brain scanner where they are asked to 'actively think about' words like 'death', 'hopeless' and 'overdose', and then declaring that their brain responses can be used to predict their suicidality. Bear in mind that half the group with suicidal ideation had previously attempted suicide (defined as 'potentially self-injurious behaviour with some non-zero intention of dying'). I wonder exactly how this study was explained to the participants, and whether they were supported afterwards to ensure that participation in the study, and the subsequent publicity of findings, did not have any adverse effects on them.
Prof Bishop -- Thanks for your comments. You raise a number of good points about subject safety and follow-up that were not addressed in the main paper or Supplementary Materials. The omission of this information would not be acceptable in a clinical journal. If the authors are to state...
"In future prospective studies, it would be of great interest to learn whether our neurosemantic assessments are useful in monitoring for current suicidal risk and in predicting future suicide attempts. If so, this approach could be useful for monitoring ongoing suicidal risk and response to treatment."
...they should be very clear about what they did to debrief the current participants, to assess their emotional state, and to follow them in the future. In the 2010 Psych Sci paper by Nock et al., they discuss follow-ups that were part of the study design, but not an immediate debriefing about the IAT (which included "death" words (i.e., die, dead, deceased, lifeless, and suicide), "life" words, "self" words, and "other" words.
In Glenn, Nock et al. 2017, the authors administered three self-harm IATs: Self + Cutting using pictures, Self + Suicide using words, Self + Death using words. The cutting pictures were especially troubling, and very likely triggering for active cutters. This study presented a debriefing note to participants after IAT completion, with an option to view their results. And:
"If an individual responded to any item regarding current desire to die/hurt oneself with “extremely,” he or she received a special note on the first debriefing page:
Your responses on the survey you filled out suggest that you may want to hurt yourself or die. We are concerned that you are having these thoughts. We encourage you to contact someone who can help you cope with these thoughts and with any stressful events that may have caused them.
Additional mental health resource information was provided below this note..."
Obviously, Just et al. are aware of the potential consequences of 'actively thinking about' words like 'death', but they didn't state what they did to support the subjects after participation (and after publication of results, as you noted).
My comment commending the authors for studying a group of young adults with suicidal ideation was meant to highlight the fact that most investigators don't want to study this population. They're excluded from many (if not most) studies of depression and various treatment modalities. Some of these individuals would participate in research if they could.
Don't you think it is important that this study began with 79 people? More than half of them were excluded from the final analyses. The paper says "The selection of participants
included in the primary analyses was based only on the technical quality of the
fMRI data. The data quality was assessed in terms of the ability of a classifier to
identify which of the 30 individual concepts they were thinking about with a rank
accuracy of at least 0.6, based on the neural signatures evoked by the concepts." So the machine learning algorithm can't even be used for more than half of this sample of subjects. Shouldn't this be emphasized when considering whether this approach could ever be clinically useful?
A second point: even if everything in this paper is taken at face value, the most it shows is that the imaging data can be used to "predict" questionnaire scores. Why do we need to predict these scores when we already have them? If prediction were genuinely to occur, that could only be if the imaging data and the questionnaire data were both used to predict what each subject will do in the future.
Anonymous -- Yes, I did think it was important that the study began with 79 people and excluded over half of them, which is why I wrote:
"How can a method that excludes data from 55% of the target participants be useful??
This one seems like a showstopper. A total of 38 suicidal participants were scanned, but those who did not show the desired semantic effects were excluded due to ;poor data quality;:
The neurosemantic analyses ... are based on 34 participants, 17 participants per group whose fMRI data quality was sufficient for accurate (normalized rank accuracy > 0.6) identification of the 30 individual concepts from their fMRI signatures."
And I also said:
"...Seriously, if you throw out over half of your subjects, how can your method ever be useful? Nonetheless, the 21 “poor data quality” ideators with excessive head motion and bad semantic signatures were used in an out-of-sample analysis that also revealed relatively high classification accuracy (87%) compared to the data from the same 17 “good” controls (the data from 24 “bad” controls were excluded, apparently)."
they didn't throw out half of their sample. they used half of the sample to train the model. they used the remainder as validation.
Post a Comment
<< Home