There have been a number of recent cases of people tweeting under a false gender and now AI comes to the rescue with a simple classifier that can tell if you are male or female from a single tweet of 140 characters or less.
Twitter seems to be becoming a standard source for data analysis and machine learning experiments. The latest proves that you can learn how to decide if a tweet comes from a male or a female just based on the words used in the tweet. If you throw in some additional information like the user's name then the recognition accuracy goes up a lot, but what is interesting is that just the words used in a single tweet are enough to identify gender with a reasonably high accuracy (66%) and if you include all tweets the accuracy goes up to 75%. So although you might think that on the internet you can be a dog if you want to, you can't fool AI into thinking you are a female if you are not.
The first problem the researchers, from the Mitre Corporation, had to solve was working out the true gender of the twitter users - after all how else could they train or test their classifier? The first problem is that Twitter doesn't insist that users provide their gender. The solution was to look at the URLs of any associated websites that the user had provided and see if they were any of the standard blogs that do insist on a gender identification. You might think that this would result in a small sample but Twitter is so big that even applying this filter produced a sample of around 100,000 females and 83,000 males - and yes it is estimated that there are more females than males using Twitter and in about this ratio.
The study also didn't discriminate between languages used for the tweets and obviously restricting the language used would most likely improve the classification accuracy. The features used for classification were word and character ngrams (i.e. combinations of n words or characters) from various fields in a Tweet. As well as using the Tweet text, classifiers were also trained using screen name, full name and description. The resulting recognition accuracies were:
- One tweet 67%
- All tweets 75%
- All data 92%
The most informative field was the full name which provided an 89% accurate classifier when used on its own.
The classifier used was a balanced winnow which is a modification on the classic perceptron algorithm to use multiplicative weight up-dates. This was tested against other classifiers - Naive Bayes and Support Vector Machines - and found to be better. Training was also a processing problem with 3 million tweets available for analysis. Included in the experiment was another resource growing in popularity - Amazon's Mechanical Turk. Users were asked to perform the same classification task and most proved to be very poor at it.
Now we come to the punch line. If you look a the word and letter features that best discriminated between male and female you will find that fragments such as "love" and "hair" are strong female indicators while fragments such as "http" and "Googl" are strong male indicators. Less obviously "my", "so" and "hank" are strong female indicators and females seem to use more emoticons and exclamation marks than males!
Discriminating Gender On Twitter
If you would like to be informed about new articles on I Programmer you can either follow us on Twitter or Facebook or you can subscribe to our weekly newsletter.