Taetem Simms and Dr. Christophe Giraud-Carrier, Department of Computer Science
Introduction
Due to the rise of social media data, health scientists have become more engaged in using computational modeling to better understand health and health behavior. Pentland et al. have noted that daily use of technology “leave digital breadcrumbs – tiny records of our daily experiences” that when mined and analyzed can provide insight into health behavior and health outcomes. Both computer science expertise and health expertise are critical in the collection and analysis of the associated data.
Following in the tradition of mining social media to address various public health concerns, we focus our study on Tumblr and the non-trivial issues associated with cognitive distortions. Cognitive distortions are exaggerated or irrational thought patterns believed to exacerbate the effects of negative psychological states or disorders, such as depression and anxiety. Recognizing these distortions is a key part of therapy for many individuals experiencing these disorders.
A system that automates the detection and labeling of cognitive distortions would allow patients and counselors to have more effective, and possibly fewer, sessions, resulting in improved quality of life and lower costs.
Methodology
We follow a traditional machine learning process, wherein we identify a data source and collect raw data from it, prepare that data by extracting features to create labeled vectors that form our training data, apply machine learning techniques, and finally evaluate the performance of the models we have created.
Personal blogs are a readily available analog to the practice of journaling in counseling. We chose as our data source the microblogging site Tumblr which is public with an accessible API that allows tagging of posts. Posts from the Tumblr API were gathered using the tags “personal,” “lonely,” “pathetic,” and “sad” – 493 posts were retrieved, with 459 posts deemed acceptable for Analysis. The posts were hand-labeled as containing or not containing distortions under the direction of Dr. R. Lynn Richards, a clinical psychologist who also verified a cross-selection of the labels.
After this labeling, Linguistic Inquiry and Word Count (LIWC) analysis was performed on each post, providing 93 input attributes, 9 of which were used in the final feature set after being measured by the attribute selection technique RELIEF. 5 machine learning models – Decision Tree Learner, Logistic Regression, Naïve Bayes, Multilayer Perceptron, and K-Nearest-Neighbors – were then tested with these features. The model accuracy of each machine learning accuracy was evaluated using stratified 10-fold cross-validation – splitting the dataset into 10 equal subsets and testing on those subsets.
Results
For our machine learning models, the default model (which always predicts a post as undistorted) gave 54.9% accuracy compared to the actual labels on the data. The best accuracy was found by a trained logistic regression model with 73.0% accuracy. Limits of both false positive (24.2%) and false negative (30.4%) error were also found.
Discussion
4 attributes were highly significant (< 0.01 p-value), namely I, Negate, Tone and Leisure (See Table II). The first two seem particularly intuitive as cognitive distortion is likely to focus on self and involve a significant amount of negative expressions. In addition, our trained decision tree model with 69.9% accuracy (see Fig. 1), provides perspective on the contribution of input attributes to measured outcome.
Conclusion
We have showed how machine learning can be used to detect cognitive distortion from personal blogs. Our induced model exhibits good false positive rate (less than 24%) and false negative rate (30.4%) significantly lower than default (45.1%). Furthermore, the model is consistent with intuition as the strongest indicators of cognitive distortion are a focus on first-person accounts and above average usage of negations.
We also recognize that our ground truth is likely not 100% accurate. There is inherent uncertainty in the labeling of text for cognitive distortion.
We are encouraged by these preliminary results and argue that continued work in this area will strengthen mental health care and psychotherapy yielding lower costs, earlier detection, and better use of counseling time.