October 23, 2012April 22, 2013 by edmanley

Detecting Languages in London’s Twittersphere

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I’ve been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I’ve been handling the data generation side, and the method really is quite simple. Just like some similar work carried out by Eric Fischer, I’ve employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website’s language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer.

James has mapped up the data – shown below, or in zoomable form here – and he more fully describes some of the interesting trends that may be observed over on his blog.

With respect to the detection process, the CLD tool appears to work pretty well. In total, 66 languages were detected among the complete dataset (including a bit of Basque, Haitian Creole and Swahili, surprisingly enough), and on the whole these classifications appear to be correct. In cases where the tool is not completely confident in what is it reading – usually due to the brevity or colloquiality of a tweet – classification is marked as unknown or unreliable, and in these cases we end up losing around 1.4 million of additional tweets.

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language. On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’. I don’t know much about Tagalog but it sounds like a fun language. Nevertheless, Tagalog was excluded from our analysis.

I won’t dwell too much on discussing the results, only that Twitter appears to reveal itself here to be the severely skewed dataset we all always really knew it was. In total, 92.5% of tweets are detected as English, far above existing estimations (60%) of English speakers in London. While languages you’d expect to score highly – such as Bengali and Somali – barely feature at all. Either people only tweet in English, or usage of Twitter varies significantly among language groups in London. There is a great deal you can say about bias within the Twitter dataset, but I think I’ll save that for another day.

For the time being, enjoy the map.

21 Replies to “Detecting Languages in London’s Twittersphere”

Joe says:

November 30, -0001 at 00:00

Really interesting stuff. Any way I can find out what the data is for Welsh, Cymraeg?

Reply
Ed Manley says:

November 30, -0001 at 00:00

Hi Joe – There were 91 Welsh tweets in the database, from 64 users. Not enough data to really identify those hidden Welsh communities in London!

Reply
Brian says:

November 30, -0001 at 00:00

Just curious – what about Mandarin Chinese?

Reply
Ed Manley says:

November 30, -0001 at 00:00

Mandarin ranked at 23rd most popular. Does go to show some of the biases inherent in the Twitter user base. Languages such as Gujarati, Urdu and Punjabi barely feature in the dataset.

Reply
Hywel Jones says:

November 30, -0001 at 00:00

You might be interested to see my comments comparing the Chrome Compact Language Detector with langid.py, in classifying Welsh language tweets: my contributions start here (I hope) http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html?showComment=1331060013540#c4023371368987708538@hywelm

Reply
Ed Manley says:

November 30, -0001 at 00:00

Very interesting, thanks Hywel. Does go to indicate the impact of short tweets on the probability of correct detection, perhaps something we should look at in future work. Please let me know if you carry out any other similar work.

Reply
Piers Storey says:

November 30, -0001 at 00:00

Fascinating – I projected your map in my class today. Where did Vietnamese come?

Reply
Judith Hanna says:

November 30, -0001 at 00:00

Preponderance of English language messages also due to people with different home languages using English for many or most of their messages, as a lingua franca — especially with a ‘hey everyone’ medium like Twitter. It’s the world’s main ‘second language’ generally, even more so when activities areundertaken in an anglophone location. Also to the fact that much of London’s ethnic minority (BME) population is second/third/fourth generation ie, grew up here immersed in English and making friends and contacts across ethnic groups. Yes, its interesting to see the the non-English languages tweeted here — but use of English is ‘unmarked’ (ie, system default, conveying no additional information) for ethnicity and culture.

Reply
Ed Manley says:

November 30, -0001 at 00:00

Hi Piers, that is great to hear! I’ve put the full list of languages up at http://goo.gl/N04Zd.

Reply
Ed Manley says:

November 30, -0001 at 00:00

Judith, I think you are completely right, and this is something we have been discussing here. Part of this work is to examine the relationship between Twitter and real world interactions. The English tweets are intended to provide that default to the patterns observed here.

Reply
Fathi Mohamoud says:

November 30, -0001 at 00:00

I’m 26 years old and I was born in Somalia. I can speak Somali but I am illiterate in the Somali language. The reason why Somali doesn’t feature is probably because Somali youth in the UK (under 25s the demographic most likely to use twitter ) are mostly illiterate in Somali as well. All our written communication to each other is in English.

Reply
Dayah says:

November 30, -0001 at 00:00

Good job! Nice map :)How did you determine the location of the tweets? Was it solely based on location metadata off of the Twitter API?

Reply
Ed Manley says:

November 30, -0001 at 00:00

Hi Fathi, thanks for your comments, very interesting. This is only the start of this analysis, we hope to gain more insight into these types of patterns in the future. Dayah, thanks! Location data is taken from the metadata supplied by Twitter for each tweet. In this case, I took only those with a specified coordinate, so likely only those that are sent from mobile devices.

Reply
Alex says:

November 30, -0001 at 00:00

Most South Asian languages have only a minimal user base in the native script. Many texts and tweets (and emails, and forum posts, and blogs and…) are in a romanized version of the language, often interspersed with English. Google products (Chrome, Translate, etc.) nearly always identify these as Malay or Filipino.

Reply
Dr James Goulding says:

November 30, -0001 at 00:00

Awesome Work. Both interesting and well presented, as well as potentially being provocative. Keep it up =)

Reply
Researcher says:

November 30, -0001 at 00:00

Very impressive – it would be interesting to see how this changes with current events and other seasonal factors.

Reply
lindajoe says:

November 30, -0001 at 00:00

I impressed very much when i’m seeing your twitter map!! Love it very much!! Great job you people done here!!<a href="http://www.socialbuilders.net/">buy twitter followers</a>

Reply
Mike says:

November 30, -0001 at 00:00

Very cool :)I’d be very interested to know if you saw some changes during the Olympics?Thanks! Mike

Reply
Antonio Penta says:

November 30, -0001 at 00:00

HI, Are you going to release the data for research purpose? thanks Antonio

Reply
Pingback: Online Logo
Pingback: therapeutic massage

21 Replies to “Detecting Languages in London’s Twittersphere”

Leave a Reply Cancel reply