Note to reader: to avoid code and plot replication, this appendix is meant to be accompanied by the interactive Shiny app we built (found here: danielnjoo.shinyapps.io/shiny/).

Abstract

We explored one of the few available datasets that matches (self-reported) scores on a personality test with Twitter account metrics. We found that in the 3000-or-so available observations, personality scores on all 8 available dimensions (Big5 and Dark Triad) were unimodal and symmetric except for openness and psychopathy which were right and left skewed respectively. We also found relatively significant relationships between Twitter metrics such as Kloutscore and Percent Original Tweets, but we were unable to reproduce these variables for new Twitter users. We then built a prediction algorithm that would assign an out-of-sample Twitter user a ‘personality type’ based on its nearest neighbors in our dataset. Our model only fared marginally better than random chance, but results were promising and accurate classification appears feasible to us with a larger dataset and access to more meaningful variables, such as those derived from textual analysis.

Introduction

We were interested in exploring how personality relates to online presence because as digitally-native millennials, we divulge enormous amounts of personal information about ourselves every day, and we all know that employers do a quick Google search before any hiring decision. So we wondered what do employers hope to infer from our online presence? Our hypothesis was that they hoped to infer some sense of what ‘kind of person’ somebody is from their Facebook, Twitter or Instagram.

So in order to test this hypothesis we built a prediction algorithim to see if we could predict a user’s personality type from metrics of online presence. We found a dataset that had both metrics of personality and online presence – which is hard to find due to anonymity concerns. Of the clustering methods known to us, k-means made the most sense, so we used this to cluster the dataset, using cluster membership to represent the personality type a user belongs to. In this context, each type’s ‘average user’ was represented by its cluster’s centroid.

We predicted a ‘wild’ Twitter user’s personality type by obtaining his/her reproducible Twitter metrics that also appeared in our dataset (favourites_count, followers_count, friends_count, statuses_count) using the twitteR package. We then used these variables to find the user’s 100 finding the user’s 100 nearest neighbors in our dataset (using the same k = # of centroids) and then chose the highest occuring type as our prediction.

With k=3, when we tested our prediction algorithm’s accuracy by running it 100 times on randomly sampled observations from the dataset (in-sample), we obtained a maximum accuracy of around 40% with certain seeds, but less than 33% or random guessing most of the time.

We hypothesized that this was due to the inherent variability in the underlying dataset (which can be evidenced by the large overlapping portions in the cluster visualizations – even when the first 2 PCAs only represent <70% of total variability), our methodology to use k-means to create ‘meaningful’ categories, and the lack of strong correlations between the reproducible variables our prediction algorithm used and actual personality types.

Data

library(readxl)
library(tidyverse)
data<-readxl::read_xlsx('./shiny/twitter_data.xlsx')

We obtained the dataset used in a Kaggle competition (https://www.kaggle.com/c/twitter-personality-prediction) hosted 5 years ago by the Online Privacy Foundation. This dataset was sent to us in private communication with a staff member of the Online Privacy Foundation.

data %>% dim()
## [1] 2930  587
names(data)[-grep("X__",names(data))]
##  [1] "Big Five"                                           
##  [2] "Dark Triad"                                         
##  [3] "Privacy"                                            
##  [4] "Kloutscore"                                         
##  [5] "Twitter Attributes"                                 
##  [6] "Percentage Tweets"                                  
##  [7] "All Tweet Data"                                     
##  [8] "Original Tweets"                                    
##  [9] "LIWC Replies"                                       
## [10] "LIWC Retweets"                                      
## [11] "LIWC for days a tweeter, tweeted more than 10 times"
## [12] "LIWC for days a tweeter, tweeted more than 40 times"
## [13] "Frequency ofTweets (All)"                           
## [14] "Freq of original content"                           
## [15] "Frequency of replies"                               
## [16] "Frequency of retweets"                              
## [17] "Frequency of follow fridays"                        
## [18] "Frequency of the C word"

Our dataset featured 587 columns, 2930 observations, and 18 broad categories of variable type.

Variables

Main variables of interest:

  • Big5 and Dark Triad: these came from self-reported scores and were scaled from 0-7 and 0-5 respectively.

Other variables of interest:

  • Favourites Count, Followers Count, Friends Count, Statuses Count (these were the reproducible ones), as well as Kloutscore, Percent Original, Percent Retweet, and Percent Replies. These were all numeric variables and are intuitively scaled; Kloutscore runs from 0-100 and is available through the Twitter API but not through the twitteR package.

Univariate analysis

In terms of univariate analysis of the personality variables, we found unimodal symmetric distributions in all cases except for openness and psychopathy which were left and right skewed respectively.

In terms of univariate analysis of the Twitter attributes, we found extremely right skewed unimodal distributions for all the counts because of a few observations with millions of favourites, followers, friends, statuses.

## [1] "max favourites count: 14338, mean favourites count: 299, while sd was: 1236"
## [1] "max followers count: 2919319, mean followers count: 5273, while sd was: 113209"
## [1] "max friends count: 30435, mean friends count: 547, while sd was: 1374"
## [1] "max statuses count: 127954, mean statuses count: 11121, while sd was: 15765"

Meanwhile, the other Twitter metrics were unimodal except for Kloutscore (slightly bimodal) and all right skewed except for Percent Original which was symmetrical.

All of these univariate plots can be explored in our interactive Shiny app (https://danielnjoo.shinyapps.io/shiny/).

Results

Our main deliverables were the exploratory analysis we did and the prediction algorithm we built.

Exploratory analysis

The exploratory analysis can be categorized into three parts:

  1. relationships between personality traits and Twitter attributes: we found meaningful relationships when Twitter attributes were logarithmically transformed, and between Openness and Favourites Count, and Narcissism and Kloutscore.

  2. cluster creation: in the Shiny app, k can be selected and bar plots show what each cluster’s centroid looks like, and might be considered the ‘average’ personality traits of a user in that personality type. We found meaningful results at k=3.

  3. how clusters relate to Twitter behavior. We found meaningful differences in medians and IQRs (via boxplots) in Followers Count, Kloutscore, and Percent Original at k=3. This can be interpreted as meaning that with the 3 personality types created via k-means clustering, there were meaningful differences between these personality types in those 3 Twitter attributes.

Prediction algorithm

Our prediction algorithm as explained earlier will be evaluated in the next section.

Diagnostics

Below we test our prediction algorithm by running it 100 times on randomly sampled observations from the dataset (in-sample). It is first run with all 8 variables outlined early (data_with_cat_full), then with only the 4 variables that were reproducible using the twitteR package.

Proportion tables of the runs are also printed and demonstrate the models high bias towards one particular personality type.

## [1] "with all 8 attributes"
## [1] "at step 10, cumulative accuracy is: 0.4"
## [1] "at step 20, cumulative accuracy is: 0.25"
## [1] "at step 30, cumulative accuracy is: 0.3"
## [1] "at step 40, cumulative accuracy is: 0.35"
## [1] "at step 50, cumulative accuracy is: 0.32"
## [1] "at step 60, cumulative accuracy is: 0.333333333333333"
## [1] "at step 70, cumulative accuracy is: 0.371428571428571"
## [1] "at step 80, cumulative accuracy is: 0.375"
## [1] "at step 90, cumulative accuracy is: 0.388888888888889"
## [1] "at step 100, cumulative accuracy is: 0.4"
## [1] "accuracy: 0.4"
## preds
##    1    2    3 
## 0.72 0.02 0.26
## [1] "with the 4 reproducible attributes"
## [1] "at step 10, cumulative accuracy is: 0.1"
## [1] "at step 20, cumulative accuracy is: 0.45"
## [1] "at step 30, cumulative accuracy is: 0.433333333333333"
## [1] "at step 40, cumulative accuracy is: 0.45"
## [1] "at step 50, cumulative accuracy is: 0.4"
## [1] "at step 60, cumulative accuracy is: 0.383333333333333"
## [1] "at step 70, cumulative accuracy is: 0.385714285714286"
## [1] "at step 80, cumulative accuracy is: 0.4"
## [1] "at step 90, cumulative accuracy is: 0.377777777777778"
## [1] "at step 100, cumulative accuracy is: 0.37"
## [1] "accuracy: 0.37"
## preds
##    1    2    3 
## 0.77 0.01 0.22
# baseline prediction should be based on highest occurring category
temp2$cat[-(1:3)] %>% table %>% prop.table
## .
##         1         2         3 
## 0.3771780 0.2651179 0.3577041

(With a seed of 1)

Against a baseline prediction of 37.7% (the highest occurring type: 3), our model using reproducible Twitter variables actually does worse than a heuristic of simply predicting 3 all the time, but it is still 4% better than random chance (37%).

The model using more meaningful but irreproducible Twitter variables fared better with a 42% predictive accuracy.

Of course, one of the problems in our method is that we’re evaluating this based on in-sample data that was used in the cluster making. If we were to make our method more sound, we would have to implement a train-test split. But as far as proof-of-concept goes, we think there is some validity in using Twitter attributes to predict personality type but we will evaluate this conclusion in the next section.

Conclusion

The white elephant in the room regarding our approach is that the dataset was not only small (n<3000), but that it was also subject to huge voluntary response bias. A more sound approach to prediction would involve using a larger dataset that aimed to mitigate voluntary response bias.

Further, an issue with using k-means to cluster our dataset into ‘personality types’, is that the centroid based approach of k-means results in groups that are by consequence of this approach different from each other. But this does not mean that these groups are necessarily meaningful. For example when looking at the cluster visualization plot in the Shiny app, we see that there are significant regions where points could conceivably belong to 2 or more groups.

Worse yet, this visualization only maps the first 2 principal components of the clustered data, and as we see below in the case of the Big5 subset, these componenents don’t even explain 80% of the variability in the 5 personality dimensions. Instead, they only explain 51.3%.

#pca
big5 <- data[4:nrow(data),2:6] %>% sapply(., as.numeric) %>% as.data.frame()
names(big5) <- data[3, 2:6]
big5 %>% 
  lapply(as.numeric) %>% 
  as.data.frame  %>% 
  log %>% 
  prcomp(center=T,scale=T) %>% 
  summary()
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.2223 1.0358 0.9699 0.8752 0.8524
## Proportion of Variance 0.2988 0.2146 0.1881 0.1532 0.1453
## Cumulative Proportion  0.2988 0.5134 0.7015 0.8547 1.0000

We conclude that our approach provides some proof-of-concept that personality types can be inferred from online presence if an appropriate dataset (or theoretical understanding of what constitutes a personality type) is used to create those personality types, as well as meaningful variables that measure online presence. Some of the most meaningful variables that we saw in the dataset were results of textual analysis (the LIWC API, their website https://liwc.wpengine.com/), which we were unable to reproduce due to lack of access to the API.

One plot is shown below between negemo, a measure of the negative emotions expressed in a user’s Tweet language and a logarithmic transform of psychopathy.

data_with_names %>% ggplot(aes(negemo, log(psychopathy)))+geom_point()+geom_smooth(se=F, method='lm')