Detecting Languages in London’s Twittersphere

October 23rd, 2012 | Posted by edmanley in Cities | Language | Mapping and Cartography | Understanding Twitter

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I’ve been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I’ve been handling the data generation side, and the method really is quite simple.  Just like some similar work carried out by Eric Fischer, I’ve employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website’s language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer.

James has mapped up the data – shown below, or in zoomable form here – and he more fully describes some of the interesting trends that may be observed over on his blog.

Detecting Languages in London's Twittersphere

With respect to the detection process, the CLD tool appears to work pretty well.  In total, 66 languages were detected among the complete dataset (including a bit of Basque, Haitian Creole and Swahili, surprisingly enough), and on the whole these classifications appear to be correct.  In cases where the tool is not completely confident in what is it reading – usually due to the brevity or colloquiality of a tweet – classification is marked as unknown or unreliable, and in these cases we end up losing around 1.4 million of additional tweets.

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language.  On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’.  I don’t know much about Tagalog but it sounds like a fun language.  Nevertheless, Tagalog was excluded from our analysis.

I won’t dwell too much on discussing the results, only that Twitter appears to reveal itself here to be the severely skewed dataset we all always really knew it was.  In total, 92.5% of tweets are detected as English, far above existing estimations (60%) of English speakers in London.  While languages you’d expect to score highly – such as Bengali and Somali – barely feature at all.  Either people only tweet in English, or usage of Twitter varies significantly among language groups in London.  There is a great deal you can say about bias within the Twitter dataset, but I think I’ll save that for another day.

For the time being, enjoy the map.

 

You can follow any responses to this entry through the RSS 2.0 You can leave a response, or trackback.

19 Responses

  • Joe says:

    Really interesting stuff. Any way I can find out what the data is for Welsh, Cymraeg?

  • Ed Manley says:

    Hi Joe – There were 91 Welsh tweets in the database, from 64 users. Not enough data to really identify those hidden Welsh communities in London!

  • Brian says:

    Just curious – what about Mandarin Chinese?

  • Ed Manley says:

    Mandarin ranked at 23rd most popular. Does go to show some of the biases inherent in the Twitter user base. Languages such as Gujarati, Urdu and Punjabi barely feature in the dataset.

  • Hywel Jones says:

    You might be interested to see my comments comparing the Chrome Compact Language Detector with langid.py, in classifying Welsh language tweets: my contributions start here (I hope) http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html?showComment=1331060013540#c4023371368987708538@hywelm

  • Ed Manley says:

    Very interesting, thanks Hywel. Does go to indicate the impact of short tweets on the probability of correct detection, perhaps something we should look at in future work. Please let me know if you carry out any other similar work.

  • Piers Storey says:

    Fascinating – I projected your map in my class today. Where did Vietnamese come?

  • Judith Hanna says:

    Preponderance of English language messages also due to people with different home languages using English for many or most of their messages, as a lingua franca — especially with a ‘hey everyone’ medium like Twitter. It’s the world’s main ‘second language’ generally, even more so when activities areundertaken in an anglophone location. Also to the fact that much of London’s ethnic minority (BME) population is second/third/fourth generation ie, grew up here immersed in English and making friends and contacts across ethnic groups. Yes, its interesting to see the the non-English languages tweeted here — but use of English is ‘unmarked’ (ie, system default, conveying no additional information) for ethnicity and culture.

  • Ed Manley says:

    Hi Piers, that is great to hear! I’ve put the full list of languages up at http://goo.gl/N04Zd.

  • Ed Manley says:

    Judith, I think you are completely right, and this is something we have been discussing here. Part of this work is to examine the relationship between Twitter and real world interactions. The English tweets are intended to provide that default to the patterns observed here.

  • Fathi Mohamoud says:

    I’m 26 years old and I was born in Somalia. I can speak Somali but I am illiterate in the Somali language. The reason why Somali doesn’t feature is probably because Somali youth in the UK (under 25s the demographic most likely to use twitter ) are mostly illiterate in Somali as well. All our written communication to each other is in English.

  • Dayah says:

    Good job! Nice map :)How did you determine the location of the tweets? Was it solely based on location metadata off of the Twitter API?

  • Ed Manley says:

    Hi Fathi, thanks for your comments, very interesting. This is only the start of this analysis, we hope to gain more insight into these types of patterns in the future. Dayah, thanks! Location data is taken from the metadata supplied by Twitter for each tweet. In this case, I took only those with a specified coordinate, so likely only those that are sent from mobile devices.

  • Alex says:

    Most South Asian languages have only a minimal user base in the native script. Many texts and tweets (and emails, and forum posts, and blogs and…) are in a romanized version of the language, often interspersed with English. Google products (Chrome, Translate, etc.) nearly always identify these as Malay or Filipino.

  • Awesome Work. Both interesting and well presented, as well as potentially being provocative. Keep it up =)

  • Researcher says:

    Very impressive – it would be interesting to see how this changes with current events and other seasonal factors.

  • lindajoe says:

    I impressed very much when i’m seeing your twitter map!! Love it very much!! Great job you people done here!!<a href="http://www.socialbuilders.net/">buy twitter followers</a>

  • Mike says:

    Very cool :)I’d be very interested to know if you saw some changes during the Olympics?Thanks! Mike

  • Antonio Penta says:

    HI, Are you going to release the data for research purpose? thanks Antonio



Leave a Reply