Detecting Languages in London’s Twittersphere

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I’ve been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I’ve been handling the data generation side, and the method really is quite simple.  Just like some similar work carried out by Eric Fischer, I’ve employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website’s language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer.

James has mapped up the data – shown below, or in zoomable form here – and he more fully describes some of the interesting trends that may be observed over on his blog.

Detecting Languages in London's Twittersphere

With respect to the detection process, the CLD tool appears to work pretty well.  In total, 66 languages were detected among the complete dataset (including a bit of Basque, Haitian Creole and Swahili, surprisingly enough), and on the whole these classifications appear to be correct.  In cases where the tool is not completely confident in what is it reading – usually due to the brevity or colloquiality of a tweet – classification is marked as unknown or unreliable, and in these cases we end up losing around 1.4 million of additional tweets.

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language.  On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’.  I don’t know much about Tagalog but it sounds like a fun language.  Nevertheless, Tagalog was excluded from our analysis.

I won’t dwell too much on discussing the results, only that Twitter appears to reveal itself here to be the severely skewed dataset we all always really knew it was.  In total, 92.5% of tweets are detected as English, far above existing estimations (60%) of English speakers in London.  While languages you’d expect to score highly – such as Bengali and Somali – barely feature at all.  Either people only tweet in English, or usage of Twitter varies significantly among language groups in London.  There is a great deal you can say about bias within the Twitter dataset, but I think I’ll save that for another day.

For the time being, enjoy the map.


21 Replies to “Detecting Languages in London’s Twittersphere”

  1. Hi Joe – There were 91 Welsh tweets in the database, from 64 users. Not enough data to really identify those hidden Welsh communities in London!

  2. Mandarin ranked at 23rd most popular. Does go to show some of the biases inherent in the Twitter user base. Languages such as Gujarati, Urdu and Punjabi barely feature in the dataset.

  3. Very interesting, thanks Hywel. Does go to indicate the impact of short tweets on the probability of correct detection, perhaps something we should look at in future work. Please let me know if you carry out any other similar work.

  4. Preponderance of English language messages also due to people with different home languages using English for many or most of their messages, as a lingua franca — especially with a ‘hey everyone’ medium like Twitter. It’s the world’s main ‘second language’ generally, even more so when activities areundertaken in an anglophone location. Also to the fact that much of London’s ethnic minority (BME) population is second/third/fourth generation ie, grew up here immersed in English and making friends and contacts across ethnic groups. Yes, its interesting to see the the non-English languages tweeted here — but use of English is ‘unmarked’ (ie, system default, conveying no additional information) for ethnicity and culture.

  5. Judith, I think you are completely right, and this is something we have been discussing here. Part of this work is to examine the relationship between Twitter and real world interactions. The English tweets are intended to provide that default to the patterns observed here.

  6. I’m 26 years old and I was born in Somalia. I can speak Somali but I am illiterate in the Somali language. The reason why Somali doesn’t feature is probably because Somali youth in the UK (under 25s the demographic most likely to use twitter ) are mostly illiterate in Somali as well. All our written communication to each other is in English.

  7. Good job! Nice map :)How did you determine the location of the tweets? Was it solely based on location metadata off of the Twitter API?

  8. Hi Fathi, thanks for your comments, very interesting. This is only the start of this analysis, we hope to gain more insight into these types of patterns in the future. Dayah, thanks! Location data is taken from the metadata supplied by Twitter for each tweet. In this case, I took only those with a specified coordinate, so likely only those that are sent from mobile devices.

  9. Most South Asian languages have only a minimal user base in the native script. Many texts and tweets (and emails, and forum posts, and blogs and…) are in a romanized version of the language, often interspersed with English. Google products (Chrome, Translate, etc.) nearly always identify these as Malay or Filipino.

  10. Pingback: Online Logo

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.