Analysing Languages in the New York Twittersphere

Following the interest in our Twitter language map of London a few months back, James Cheshire and I have been working on expanding our horizons a bit.  This time teaming up with John Barratt at Trendsmap, our new map looks at the Twitter languages of New York, New York!  This time mapping 8.5 million tweets, captured between January 2010 and February 2013.

Without further ado, here is the map. You can also find a fully zoomable, interactive version at, courtesy of the technical wizardry of Ollie O’Brien.


James has blogged over on Spatial Analysis about the map creation process and highlighted some of the predominant trends observed on the map.  What I thought I’d do is have a bit more of a deeper look into the underlying language trends, to see if slightly different visualisation techniques provide us with any alternative insight, and the data handling process.

Spatial Patterns of Language Density

Further to the map, I’ve had a more in-depth look at how tweet density and multilingualism varies spatially across New York.  Breaking New York down into points every 50 metres, I wrote a simple script (using Java and Geotools) that analysed tweet patterns within a 100 metre radius of each point.  These point summaries are then converted in a raster image – a collection of grid squares – to provide an alternative representation of spatial variation in tweeting behaviour.

Looking at pure tweet density to begin with, we take all languages into consideration.  From this map it is immediately clear how Manhattan dominates as the centre for Twitter activity in New York.  Yet we can also see how tweeting is far from constrained to this area, spreading out to areas of Brooklyn, Jersey City and Newark.  By contrast, little Twitter activity is found in areas like Staten Island and Yonkers.

Analysing Languages in the New York Twittersphere

Density of tweets per grid square (Coastline courtesy of ORNL)

By the same token, we can look at how multilingualism varies across New York, by identifying the number of languages within each grid square.  And we actually get a slightly different pattern.  Manhattan dominates again, but with a particularly high concentration in multilingualism around the Theatre District and Times Square – predominantly tourists, one presumes.  Other areas, where tweet density is otherwise high – such as Newark, Jersey City and the Bronx – see a big drop off where it comes to the pure number of languages being spoken.

Analysing Languages in the New York Twittersphere

Number of languages per 50m grid square (Coastline courtesy of ORNL)

Finally, taking this a little bit further, we can look at how multilingualism varies with respect to English language tweets.  Mapping the percentage of non-English tweets per grid square, we begin to get a sense of the areas of New York less dominated by the English language, and remove the influence of simply tweet density.  The most prominent locations, according to this measure, are now shown to be South Brooklyn, Coney Island, Jackson Heights and (less surprisingly) Liberty Island.  It is also interesting to see how Manhattan pretty much drops off the map here – it seems there are lots of tweets sent from Manhattan, but by far the majority are sent in English.

Analysing Languages in the New York Twittersphere

Percentage of non-English tweets per 50m grid square (Coastline courtesy of ORNL)

 Top Languages

So, having viewed the maps, you might now be thinking, ‘Where’s my [insert your language here]?’.  Well, check out this list, the complete set of languages ranked by count.  If your language still isn’t there then maybe you should go to New York and tweet something.

As you will see from the list, in common with London, English really dominates in the New York Twittersphere, making up almost 95% of all tweets sent.  Spanish fares well in comparison to other languages, but still only makes up 2.7% of the entire dataset.  Clearly, you wouldn’t expect the Twitter dataset to represent anything close to real-world interactions, but it would be interesting to hear from any New Yorkers (or linguists) about their interpretation of the rankings and volumes of tweets in each language.

Language Processing

Finally, a small word on the data processing front.  Keen readers will be aware that in the course of conducting the last Twitter language analysis, we experienced a pesky problem with Tagalog.  Not that I have a problem the language per se, but I refused to believe that it was the third most popular language in London.  The issue was to do with a quirk of the Google Compact Language Detector, and specifically its treatment of ‘hahaha’s and ‘lolololol’s and the like.  For this new analysis – working work with John Barratt and the wealth of data afforded to us by Trendsmap – we’ve increased the reliability of the detection, removing tweets less than 40 characters, @ replies and anything Trendsmap has already identified as spam.  So long, Tagalog.


Detecting Languages in London’s Twittersphere

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I’ve been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I’ve been handling the data generation side, and the method really is quite simple.  Just like some similar work carried out by Eric Fischer, I’ve employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website’s language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer.

James has mapped up the data – shown below, or in zoomable form here – and he more fully describes some of the interesting trends that may be observed over on his blog.

Detecting Languages in London's Twittersphere

With respect to the detection process, the CLD tool appears to work pretty well.  In total, 66 languages were detected among the complete dataset (including a bit of Basque, Haitian Creole and Swahili, surprisingly enough), and on the whole these classifications appear to be correct.  In cases where the tool is not completely confident in what is it reading – usually due to the brevity or colloquiality of a tweet – classification is marked as unknown or unreliable, and in these cases we end up losing around 1.4 million of additional tweets.

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language.  On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’.  I don’t know much about Tagalog but it sounds like a fun language.  Nevertheless, Tagalog was excluded from our analysis.

I won’t dwell too much on discussing the results, only that Twitter appears to reveal itself here to be the severely skewed dataset we all always really knew it was.  In total, 92.5% of tweets are detected as English, far above existing estimations (60%) of English speakers in London.  While languages you’d expect to score highly – such as Bengali and Somali – barely feature at all.  Either people only tweet in English, or usage of Twitter varies significantly among language groups in London.  There is a great deal you can say about bias within the Twitter dataset, but I think I’ll save that for another day.

For the time being, enjoy the map.


The Diamond Jubilee in London: A Tweet Location Analysis

I’ve been collecting Twitter data for a little while now, and have managed to identify some interesting (if slightly frivolous) trends.  But, when considering the wider applications of such a dataset, one question that has continued to bug me is – Why do we tweet when we tweet?

I won’t attempt to answer that question here (yet), but one clear reason is when we want to communicate our involvement in an event or activity.  You can see it quite clearly in the data – gigs at the O2, football matches at the Emirates – all of these events show up as clusters of tweet points.  So, with the Diamond Jubilee celebrations occurring in London last weekend, I thought this would be a nice opportunity to demonstrate how these crowd patterns form and disappear over space and time.  The images below – I hope you will agree – are quite pretty, but I think the analysis presents some more interesting implications with regard to the use of this type of dataset and the nature of visualisation, aspects I’ll address at the end.


Tweeting the Diamond Jubilee

What I’ve done here is look at all tweets mentioning ‘Jubilee’ occurring in London on the 3rd and 4th June 2012.  As you good patriots will recall, these were the dates of the Thames flotilla and Jubilee concert outside Buckingham Palace.  For you more technically-minded people, I’ve taken the tweet point locations and applied a Kernel Density Estimation on them, to provide a sense of where the highest density of tweets were occurring on each day.

The colour scheme – in the colours of the flag, of course – shows the shift from high density areas of Jubilee-related tweets (in red) to areas where not many such tweets are detected (in blue).

Flotilla Day

On the day of the flotilla, you can clearly see a strong distribution of tweeting monarchists along the course of the flotilla on the River Thames.  It can be noted that this distribution is not spatially uniform, however, indicating perhaps the locations of the best, or most popular, viewing areas.  You can see other clusters around London too, which may indicate where other gatherings were taking place.

The Diamond Jubilee in London:  A Tweet Location Analysis

We can also look at this data in 3D too, allowing us to better explore where the absolute highest densities of tweets were occurring within those big clusters of red…

The Diamond Jubilee in London:  A Tweet Location Analysis

Interestingly, this map helps to better draw out where the exact hotspots lie. Revealing that the highest densities are at each the bridges along the route, with Vauxhall and London bridges seeing the greatest activity.

Concert Day

The day of the concert – taking place on the evening of the 4th June – indicates clearly a completely different pattern of behaviour.

The Diamond Jubilee in London:  A Tweet Location Analysis

Here the biggest activity is along the Mall and towards the Jubilee concert outside Buckingham Palace.  One can also identify big clusters of tweets in Hyde Park and around Soho, again with lots of other clusters dotted around the city.  Overall, there appears to be a lesser concentration of tweets than seen on the day of the flotilla, something that appears to follow that reported in the press.

Again, consulting the 3D representation of the data, shows us more exactly where the largest clusters of tweets are located…

The Diamond Jubilee in London:  A Tweet Location Analysis

This image again demonstrates the importance of an alternative perspective.  In this case, we can see that the most important cluster is found along the Mall at the concert itself, with the other activity highlighted in the 2D perspective seemingly of much lesser significance.


What does all this actually mean?

OK, OK, so you may be thinking at this point ‘Yes, very nice pictures and everything, but isn’t this all fairly obvious?’.  Well in some ways yes, we know from the television pictures that there were a lot of people along the Thames on the 3rd June watching the flotilla.  What we have a lesser grasp on is the exact volume and spatial distribution of these people, and how they moved throughout the day.

My feeling is that, although biased in many respects, this dataset provides us with a unique opportunity to measure the spatial distribution of crowds at events. It may well only be a proxy for activity, but rather than relying on a few, subjective viewpoints, we are able to get a better overall indication of the true patterns of crowds in space and time.  Such analysis may also help us to identify emerging, organic events, outside of our current viewpoint, that require our attention.

In regard to these images in particular, I hope that the Kernel Density approach has been of interest to some of you of a less geographic mindset.  They do quite effectively highlight the locations of tweet hotspots.  The differences between the 2D and 3D images do demonstrate, however, how the visualisation of data can become misleading.  What appear to be large events in one representation are much less significant when viewed from an alternative perspective.  This is a facet of data visualisation that we all should be conscious of.

As ever, your thoughts on anything I’ve presented here are very welcome.


Edit (11-06-12)

You can now find video animations of the 3D results here and here.


Mapped: London’s ‘Rudest’ Boroughs

A couple of weeks ago, I put up a post detailing how swearing on Twitter increases during the course of the average day.  It seemed people get more angry and sweary outside of work time, rather than during.

To delve a little deeper in this topic, I’ve now had a look at where Twitter gets angry.  For each of London’s 33 boroughs I have carried out the same analysis – this time for a month’s worth of tweets – looking at the percentage of tweets containing swear words in each borough.  The results follow some interesting trends… LondonsRudestBoroughs.PNG.scaled1000

At least in the Twittersphere, inner London appears to be the veritable paradise of civility relative to the bile-filled tweet streams emanating from outer London.  The biggest offenders appear to be located to the east of city, with east London fairing considerably worse.  Yet the leafy boroughs of Barnet, Sutton and Bromley perform badly too.

Right, so let’s first look at what doesn’t seem to be going on here.  First off, the influence of this idea that people mostly swear from the comfort of their own sofa does not seem to hold very true.  There does not seem to be a very strong relationship between swearing density and residential locations.  If there were then you’d see higher scores in the likes of Haringey, Richmond, Hammersmith and Fulham and Newham. Nor does swearing follow any sort of deprivation index, again Haringey is relatively poor compared to the likes of Sutton, Bromley and Barnet, which fair much worse.

So what is going on?

In my opinion, what I think we are seeing is a reflection of demographic and cultural trends across these boroughs.  Taking demography in the first instance, according to the 2009 figures on nationality demographics at the borough level, those London boroughs with the highest percentages of British-born citizens are Havering, Bexley, Bromley and Sutton, respectively*.   It would make sense that the higher the percentage of British-born citizens in an area – on average those probably more likely to use an English swear word in a tweet – the greater the number of swearing tweets there are likely to be.  True, but I don’t think this tells the whole story.

Looking beyond these four boroughs, Kingston and Richmond also report high percentages of British citizens living within their boundaries – yet we don’t see similar volumes of sweary tweets coming from these boroughs.  How can this be so?  Make of this what you will, but beyond the demographic variation, the data appears to highlight a cultural variation across London in attitudes towards swearing in tweets.  Simply put, the data seems to suggest that the good residents of eastern and southern boroughs of outer London are generally more inclined to throw a swear word into a tweet than their counterparts over to the western side of London.

As I say, this is just my theory – there is a whole lot more you could do to this data to gain a better understanding of the trends observed here (unfortunately I don’t have the time to do so!).  I’d be very interested to hear any alternative ideas about what might be going on though.

Overall, I hope these analyses begin to give you an insight into the extent to which Twitter data (and other data sources like it) can be used to reveal and explain social, spatial and temporal trends.

* Newham, Westminster and Kensington and Chelsea score highest for non-British born residents


When does Twitter get angry?

I’ve been spending a bit of time with Twitter data of late – perhaps not a healthy activity – but it is amazing what a rich data source of social and spatial behaviour it is.

Someone asked to me today whether it was possible to identify when and where Twitter gets angry.  Well, here is my answer to the first part – the when.

The graph below shows the variation, across the day, in the prevalence of swearing in the ‘Twittersphere’.  The data used represents tweets during two weeks in March 2012 covering London only – so maybe this is just when London gets angry…

In the graph we have the percentage of all tweets containing ALL types of swearing in blue, in red we have the prevalence of the f-word (by far the most common swear word), then finally the percent appearance of the s-word is shown in green.  Time is along the bottom.

When does Twitter get angry?

Putting the slightly frivolous nature of this work aside for a second, the data does demonstrate some interesting trends.  There is a clear upward trend in ‘anger’ as the day goes on, reaching a peak at around 10pm.  But why is this?  Why do we swear more in the evening, when we should be relaxed and enjoying our precious free time?  Are we (we being Twitter users only, of course) swearing at the TV?  Arguing with our friends over Twitter?  Or are enough of us getting drunk and losing our inhibitions?

We also see a smaller peak at around 5pm – now this is more easily explained.  The ‘thank f**k work is over’ tweet one might surmise.  An even smaller peak at around 9am suggests the opposite effect.

But I think this simple analysis gives us some insight into the way we use social media throughout the day.  During the day we think about work.  We tweet and communicate about work.  Yet in the evening, Twitter becomes a different place.  We let our guard down, and once we’re outside of the constraints of work, perhaps we begin to use Twitter in a different way.  Places like Twitter allow us the space to exclaim and let off our true feelings, whatever they may be, that might otherwise be constrained in other environments.

Twitter gets a lot of stick for its high volume of frivolous content – probably with good reason – but at a higher level some subtle but interesting social trends can start to be observed.