With a rise of a social media usage and the burst of available information, user profiling became an important tool in security, forensics, marketing, etc. For instance, from a forensic linguistics perspective, one is able to determine the linguistic profile of a user sending harassing text messages or to identify a fake user account. From a marketing viewpoint, companies can learn what kind of people like or dislike their products just on the basis of online product reviews. For example, our team in Kraken Systems did such profile analysis of Twitter posts to match a user to a fashion designer or to a book author. That resulted in great recommendations and lead to higher user engagement for the client.
User profiling consists of discovering as much insight as possible about an unknown user just by analysing his online posts. Users can be distinguished between classes by studying their sociolect aspect, i.e. how long sentences they write or what words they use, etc. With this information, it is possible to identify user’s gender, age, native language and personality type.
Studies have found that women use more words expressing emotions (e.g., ‘excited’), emoticons (almost three times more than man), exclamation marks and a first-person singulars, and they mention more psychological and social processes (e.g., ‘love you’); while men use more swear words, objects (e.g., ‘xbox’ or ‘pc’) and political or sports references.
Age predominant topics follow the life span: from school/college to work and family. Some topics are even more time specific, such as excessive drinking for 19-22 year olds (e.g. ‘puked’, ‘hangover’, ‘wasted’), and more reserved beer related phrases for 23-29 year olds (e.g. ‘beer’, ‘drinking’, ‘ale’).
Younger users tend to use a larger number of internet slang words (such as ‘lol’ and ‘omg’), hashtags, all caps and lengthened words (eg. ‘whaaaaaaat’); while older users tend to have longer length of average post and greater number of replies. As people age, the use of ‘we’ increases approximately linearly, along with friendships and relationships references, whereas ‘I’ simultaneously decreases.
Further, user’s personality profile can be described using five traits, the so-called ‘Big Five personality traits’:
- extraversion - outgoing, amicable, assertive, energetic;
- neuroticism - anxious, insecure, sensitive;
- agreeableness - cooperative, helpful, nurturing, trusting;
- conscientiousness - responsible, organized, persevering;
- openness to experience - curious, intelligent, creative.
Extraverts are more likely to mention social words such as ‘party’, ‘love you’, ‘boys’ and ‘ladies’, whereas introverts use words related to solitary activities such as ‘computer’, ‘internet’ and ‘reading’. People who rate higher in neuroticism (less emotionally stable) tend to use more anxiety words (like ‘worry’) and short sentences. Openness relates to ‘music’, ‘art’ and ‘writing’ (i.e., creativity) and not with ‘dream’, ‘universe’ and ‘soul’ (i.e., imagination). Conscientiousness is negatively correlated with words about death (e.g. ‘bury’, ‘coffin’, ‘kill’), negative emotions and sadness, suggesting conscientious people tend to talk less about unhappy subjects; while it’s positively correlated with the use of ‘you’, indicating the same people tend to talk about or to others. Agreeable people also tend to use ‘you’ a lot, but are less likely to talk about achievements and money. The list of features goes on…
So, if you have a well-annotated data set to train your model, it is possible to create an algorithm which determines user’s gender, age-group and personality traits score. With models like linear SVM or Gaussian Naive Bayes for age/gender classification and SVR for personality insights, you can easily achieve an accuracy of 70% and increase it further with better preprocessing and features adjustment. A better data understanding leads to a higher accuracy, but the biggest problems are the variety of dialects, grammar challenges and writing errors. Still, even a smaller accuracy can give you valuable insight into your user’s personality and improve your business.
- Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., et al. (2013): Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach, ed. PLoS ONE, 2013; 8(9):e73791, doi:10.1371/journal.pone.0073791.