Identifying Communities in Traffic Flow

One recent bit of research I have been working on has been looking at the application of community detection algorithms to traffic flow in London.

The idea is that within the traffic system exist a number of sub-systems of highly interconnected roads.  To a certain extent, these sub-systems are engineered into the system.  Transport for London, for example, specifically manage and maintain 23 key routes into and around central London, known as ‘corridors’.  However, to what extent do further systems exist outside of these defined zones?

Community detection algorithms were developed to identify clusters within a network dataset.  These methods are most often applied to examples within the social network sphere, in the identification of cliques, where a cluster demonstrates high inter-connectivity, with lower connectivity with the rest of the network.  My thinking behind this bit of work was that we might be able to identify similar characteristics in traffic flow, where we can observed high coupling between clusters of nodes.

The map below visualises the modules (distinguished by colour) identified through the application of community detection methods to a topological representation of the road network.  Node connectivity is established using a dataset of 1.5 million private hire cab routes through London.


The resulting visualisation, apart from being quite pretty (thank Gephi for that), reveal some interesting trends.  To a certain extent, a number of expected patterns in traffic flow are prevalent, with some of the ‘corridors’ into central London, such as the M3, M4 and A2, clearly defined as distinct clusters.  Yet the image also shows how both the M25, the ring road around London, and the North Circular, usually considered as single entities, can be segmentalised into modules defined by their usage.

We also see further interesting patterns in central London too, where certain regions – specifically Knightsbridge, Soho, Shoreditch the City and Hyde Park – are clearly defined as distinct modules.  These would appear to be areas of high internal movement, and thus a clear product of cab usage patterns.

These results, while presented only in their initial stages, demonstrate how measures of network characteristics can help us to understand dynamic patterns of movement in the city.



Thanks to all for the interest in this work!

Just by way of follow up, the image below shows a zoom in on Central London, demonstrating more clearly some of the regions mentioned above.  I’ve annotated this version for people who may not be familiar with London.



Further to my last post and various requests, I’ve published the complete list of languages detected within the whole collection of geolocated tweets in London.

The list contains the full counts ranked for each language (excluding Tagalog), as well as the count of detections classed as ‘Unknown’ – probably due to the tweet being too short, or too colloquial, for the detector to work out what language is being written.

You can find that full list here.

Detecting Languages in London's Twittersphere

Over the last couple of weeks, and as a bit of a distraction from finishing off my PhD, I’ve been working with James Cheshire looking at the use of different languages within my aforementioned dataset of London tweets.

I’ve been handling the data generation side, and the method really is quite simple.  Just like some similar work carried out by Eric Fischer, I’ve employed the Chromium Compact Language Detector – a open-source Python library adapted from the Google Chrome algorithm to detect a website’s language – in detecting the predominant language contained within around 3.3 million geolocated tweets, captured in London over the course of this summer.

James has mapped up the data – shown below, or in zoomable form here – and he more fully describes some of the interesting trends that may be observed over on his blog.

Detecting Languages in London's Twittersphere

With respect to the detection process, the CLD tool appears to work pretty well.  In total, 66 languages were detected among the complete dataset (including a bit of Basque, Haitian Creole and Swahili, surprisingly enough), and on the whole these classifications appear to be correct.  In cases where the tool is not completely confident in what is it reading – usually due to the brevity or colloquiality of a tweet – classification is marked as unknown or unreliable, and in these cases we end up losing around 1.4 million of additional tweets.

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language.  On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’.  I don’t know much about Tagalog but it sounds like a fun language.  Nevertheless, Tagalog was excluded from our analysis.

I won’t dwell too much on discussing the results, only that Twitter appears to reveal itself here to be the severely skewed dataset we all always really knew it was.  In total, 92.5% of tweets are detected as English, far above existing estimations (60%) of English speakers in London.  While languages you’d expect to score highly – such as Bengali and Somali – barely feature at all.  Either people only tweet in English, or usage of Twitter varies significantly among language groups in London.  There is a great deal you can say about bias within the Twitter dataset, but I think I’ll save that for another day.

For the time being, enjoy the map.


London 2012: Using Fear to Tame Transportation Demand

One of the biggest advantages, I feel, about studying urban transport phenomena in London is the simple ability to be able look out of the window and see what is actually going on.  This week, the Olympics and its (supposed) transportation chaos, came to London.

What has struck me early on, mainly since the introduction of the Games Lanes last week, is a big reduction in the number of vehicles on the road.  There have been reports of certain inevitable problems in various parts of the capital, but my experience has been a general reduction in demand on most roads (see a couple of photos I took below).  This sentiment has been shared by a number of my colleagues.  There has been no word yet from Transport for London as to whether the data is backing this up.

London 2012: Using Fear to Tame Transportation Demand

Second, the big public transport problems predicted at certain stations and at certain times, have no yet come to fruition.  Warnings were issued widely this morning about potential overcrowding at a number of stations, yet early reports suggest that this is far from the reality – the Guardian highlight a number of citizen reports of empty Tube seats and quiet stations this morning.

London 2012: Using Fear to Tame Transportation Demand

Typical fear-inducing GetAheadOfTheGames literature (copyright Transport for London 2012)

It appears that the strategy has worked.  In fact, one might even suggest that it has worked better than expected.  I would say that this is partly down to the impact of irrationality, specifically the impact of fear.  Individuals, scared of potentially having to wait considerable amounts of time at stations only to cram into packed Tube trains, or fearful of long queues on the roads, have changed their habitual plans en masse.

Social Phenomena

The effect has gone to demonstrate, at least to me, the impact that small changes in the behaviour of many individuals can have on the nature of the city.  As individuals, we make a choice, we carry out that action, and we are mostly unaware of the impact that decision has on shaping broader phenomena.  Yet, in observing the patterns these many individuals make, we can begin to see how individual and social attitudes impact on shaping transportation flows.

This relationship, specifically the impact that fear has had in the context of the Olympics, appears to have caught some analysts on the hop.  INRIX, a big transport data provider, predicted earlier in the year the ‘perfect traffic storm‘ in traffic demand during the first few days of the Games (reported in more detail here).  This patently failed to happen.  The models INRIX employed in making these predictions clearly failed to make consideration for the impact that fear would play in reducing traffic demand.  This approach is far from uncommon where transport demand modelling is concerned.

The Games have a long way to run yet, and we may well see a counter movement occur in time as people begin to realise that transportation isn’t as bad as first expected.  But I think the impact that fear has held on shaping, at least, the first few days of transportation flows makes for interesting viewing.

I’ve always had a problem with the pervasive assumption in transportation research that everyone takes the shortest metric distance path when travelling between A and B.  This idea doesn’t seem to have any solid foundations in research, and intuitively it doesn’t make much sense – how do you even know what the shortest distance path is anyway?

So a good deal of my research has looked into what people really do. I’m not going to reveal all here – journal papers are generally more important than blogs in assuring future employment – but I’ll share one interesting finding.

The data I have used relates to 700,000 taxi routes through London (you might remember I blogged about this dataset previously).  For each of these routes, between origin and destination, I have also calculated an optimum path, according to a range of metrics, one being distance.  Then, as far as this blog post goes, I have compared each route and calculated the percentage match between the real route and the optimum shortest distance journey.


So is the shortest distance path a decent representation of reality?  No.

On average, the shortest distance path is able to estimate only 39.8% of each route.  Pretty poor when you consider that it is often used solely in predicting the behaviour of many individuals.

Not only this, the data shows that the shortest distance path is followed in entirety only very rare occasions.  Only 5% of real journeys show a match with 90% of their equivalent shortest distance path, with this value only rising to 13% when that threshold is dropped to 75%.

Minimising Distance

So, do people have no consideration for distance when they route through the city?  Well, no, that isn’t quite the case.

The graph below shows a scatter plot of real distances against actual distances.  As you can see, the relationship and resulting R-square is pretty good.


Note: Overly long routes (three times optimal distance) have been removed.

It appears that people therefore appear to minimise distance – or they at least do not at least go extremely far from the minimal – but do not generally take the optimal shortest distance path.

This is research I’m still pulling together, but I hope this post has interest to the wider community.  For anyone that is interested, do get in touch and I’ll let you know when the paper on this may be out.

The 1st International Conference on Urban Sustainability and Resilience will be held at UCL between the 5th and 7th November 2012.  The Call for Abstracts is currently active, with the deadline for 500-word abstracts being the 4th July 2012.

Please see for more information.  The usual blurb follows below:


The continuing trend toward urbanisation has brought to the fore the linkages between human societies, the technological world which they have created and live in, and the natural environment. Understanding these linkages is crucial to the survival of our species. Recent events (hurricane Katrina, Fukushima disaster, UK flooding 2007) have shown what dire consequences can ensue when weak links are overlooked.

Engineers, policy makers, designers and planners are some of the key professions shaping the future of the urban world. The decisions they make today will often affect many generations to come. As such it is essential that their decision be backed by knowledge which is both scientifically sound and also fully aware of the human factors inherent in urban issues.

The first international conference in Urban Sustainability and Resilience will bring together world experts from across a wide range of engineering, science and social science disciplines with three main objectives:

  • Bring together a strong research community committed to address some of the most pressing issues that human societies have ever faced;
  • Take stock of the current state of knowledge in the field of urban sustainability and resilience
  • Put forward a coherent future research agenda in the field. 


The central themes of the conference will be:

  • Facets of urban resilience
  • Integrating and engineering sustainable and resilient urban systems
  • Feeding the city
  • Towards a low-carbon urban environment 


In addition the conference welcomes papers and posters appropriate to one or more of the following topics:

  • Eco-cities
  • Measuring resilience
  • Transport
  • Water
  • Security
  • ICT
  • Retrofitting
  • Adapting to Climate Change
  • Managing Ageing Infrastructure
  • Sustainability Indicators
  • Waste
  • Energy
  • Food
  • Material
  • Urban Visions
The Diamond Jubilee in London:  A Tweet Location Analysis

I’ve been collecting Twitter data for a little while now, and have managed to identify some interesting (if slightly frivolous) trends.  But, when considering the wider applications of such a dataset, one question that has continued to bug me is – Why do we tweet when we tweet?

I won’t attempt to answer that question here (yet), but one clear reason is when we want to communicate our involvement in an event or activity.  You can see it quite clearly in the data – gigs at the O2, football matches at the Emirates – all of these events show up as clusters of tweet points.  So, with the Diamond Jubilee celebrations occurring in London last weekend, I thought this would be a nice opportunity to demonstrate how these crowd patterns form and disappear over space and time.  The images below – I hope you will agree – are quite pretty, but I think the analysis presents some more interesting implications with regard to the use of this type of dataset and the nature of visualisation, aspects I’ll address at the end.


Tweeting the Diamond Jubilee

What I’ve done here is look at all tweets mentioning ‘Jubilee’ occurring in London on the 3rd and 4th June 2012.  As you good patriots will recall, these were the dates of the Thames flotilla and Jubilee concert outside Buckingham Palace.  For you more technically-minded people, I’ve taken the tweet point locations and applied a Kernel Density Estimation on them, to provide a sense of where the highest density of tweets were occurring on each day.

The colour scheme – in the colours of the flag, of course – shows the shift from high density areas of Jubilee-related tweets (in red) to areas where not many such tweets are detected (in blue).

Flotilla Day

On the day of the flotilla, you can clearly see a strong distribution of tweeting monarchists along the course of the flotilla on the River Thames.  It can be noted that this distribution is not spatially uniform, however, indicating perhaps the locations of the best, or most popular, viewing areas.  You can see other clusters around London too, which may indicate where other gatherings were taking place.

The Diamond Jubilee in London:  A Tweet Location Analysis

We can also look at this data in 3D too, allowing us to better explore where the absolute highest densities of tweets were occurring within those big clusters of red…

The Diamond Jubilee in London:  A Tweet Location Analysis

Interestingly, this map helps to better draw out where the exact hotspots lie. Revealing that the highest densities are at each the bridges along the route, with Vauxhall and London bridges seeing the greatest activity.

Concert Day

The day of the concert – taking place on the evening of the 4th June – indicates clearly a completely different pattern of behaviour.

The Diamond Jubilee in London:  A Tweet Location Analysis

Here the biggest activity is along the Mall and towards the Jubilee concert outside Buckingham Palace.  One can also identify big clusters of tweets in Hyde Park and around Soho, again with lots of other clusters dotted around the city.  Overall, there appears to be a lesser concentration of tweets than seen on the day of the flotilla, something that appears to follow that reported in the press.

Again, consulting the 3D representation of the data, shows us more exactly where the largest clusters of tweets are located…

The Diamond Jubilee in London:  A Tweet Location Analysis

This image again demonstrates the importance of an alternative perspective.  In this case, we can see that the most important cluster is found along the Mall at the concert itself, with the other activity highlighted in the 2D perspective seemingly of much lesser significance.


What does all this actually mean?

OK, OK, so you may be thinking at this point ‘Yes, very nice pictures and everything, but isn’t this all fairly obvious?’.  Well in some ways yes, we know from the television pictures that there were a lot of people along the Thames on the 3rd June watching the flotilla.  What we have a lesser grasp on is the exact volume and spatial distribution of these people, and how they moved throughout the day.

My feeling is that, although biased in many respects, this dataset provides us with a unique opportunity to measure the spatial distribution of crowds at events. It may well only be a proxy for activity, but rather than relying on a few, subjective viewpoints, we are able to get a better overall indication of the true patterns of crowds in space and time.  Such analysis may also help us to identify emerging, organic events, outside of our current viewpoint, that require our attention.

In regard to these images in particular, I hope that the Kernel Density approach has been of interest to some of you of a less geographic mindset.  They do quite effectively highlight the locations of tweet hotspots.  The differences between the 2D and 3D images do demonstrate, however, how the visualisation of data can become misleading.  What appear to be large events in one representation are much less significant when viewed from an alternative perspective.  This is a facet of data visualisation that we all should be conscious of.

As ever, your thoughts on anything I’ve presented here are very welcome.


Edit (11-06-12)

You can now find video animations of the 3D results here and here.


‘Modelling Movement in the City: The Influence of Individuals’ was the title of a talk I gave at the AGILE conference in Avignon, France last week.  For the conference I actually initially prepared a poster that never ended up seeing the light of day – except for now that is.

The poster presents some recent work I carried out through agent-based simulation, demonstrating how different behavioural models influence the formation of macroscopic patterns.  As you can see from the results, the impact of mere basic assumptions hold a significant impact upon the unfolding network picture.

Probably now going to write this up as a journal paper, but hopefully putting the poster up here won’t mess with any copyright stuff – please let me know if it might!


A couple of weeks ago, I put up a post detailing how swearing on Twitter increases during the course of the average day.  It seemed people get more angry and sweary outside of work time, rather than during.

To delve a little deeper in this topic, I’ve now had a look at where Twitter gets angry.  For each of London’s 33 boroughs I have carried out the same analysis – this time for a month’s worth of tweets – looking at the percentage of tweets containing swear words in each borough.  The results follow some interesting trends… LondonsRudestBoroughs.PNG.scaled1000

At least in the Twittersphere, inner London appears to be the veritable paradise of civility relative to the bile-filled tweet streams emanating from outer London.  The biggest offenders appear to be located to the east of city, with east London fairing considerably worse.  Yet the leafy boroughs of Barnet, Sutton and Bromley perform badly too.

Right, so let’s first look at what doesn’t seem to be going on here.  First off, the influence of this idea that people mostly swear from the comfort of their own sofa does not seem to hold very true.  There does not seem to be a very strong relationship between swearing density and residential locations.  If there were then you’d see higher scores in the likes of Haringey, Richmond, Hammersmith and Fulham and Newham. Nor does swearing follow any sort of deprivation index, again Haringey is relatively poor compared to the likes of Sutton, Bromley and Barnet, which fair much worse.

So what is going on?

In my opinion, what I think we are seeing is a reflection of demographic and cultural trends across these boroughs.  Taking demography in the first instance, according to the 2009 figures on nationality demographics at the borough level, those London boroughs with the highest percentages of British-born citizens are Havering, Bexley, Bromley and Sutton, respectively*.   It would make sense that the higher the percentage of British-born citizens in an area – on average those probably more likely to use an English swear word in a tweet – the greater the number of swearing tweets there are likely to be.  True, but I don’t think this tells the whole story.

Looking beyond these four boroughs, Kingston and Richmond also report high percentages of British citizens living within their boundaries – yet we don’t see similar volumes of sweary tweets coming from these boroughs.  How can this be so?  Make of this what you will, but beyond the demographic variation, the data appears to highlight a cultural variation across London in attitudes towards swearing in tweets.  Simply put, the data seems to suggest that the good residents of eastern and southern boroughs of outer London are generally more inclined to throw a swear word into a tweet than their counterparts over to the western side of London.

As I say, this is just my theory – there is a whole lot more you could do to this data to gain a better understanding of the trends observed here (unfortunately I don’t have the time to do so!).  I’d be very interested to hear any alternative ideas about what might be going on though.

Overall, I hope these analyses begin to give you an insight into the extent to which Twitter data (and other data sources like it) can be used to reveal and explain social, spatial and temporal trends.

* Newham, Westminster and Kensington and Chelsea score highest for non-British born residents


Amanda Erickson put up a nice, simply visualisation of what life might be like in a future of driverless, automated cars. Check it out.

Two things sprang to mind while watching this – first, how terrifying this might be for a passenger in one of these cars, and second, haven’t I seen this sort of thing somewhere else before?

Well, yes, I showed the following video in a lecture last month as demonstration of self-organisation.  To me, the patterns look similar – at the higher level you see chaos, but when you observe the actions of individual’s there is usually a rational stream of thought behind the actions they are taking – normally to get to their exit road.  Judge for yourself.

I think the stark similarity seen between these two videos raise interesting questions about what we consider as progress in the urban realm.  Bare with me as I attempt to explain.

The driverless or automated car is often seen as the natural future of private transportation*, with one of its main benefits being the apparent offer of optimal organisation of traffic flows (e.g. no congestion).  And indeed when look at the first video, everything works and works well, perhaps even optimally.  But then you look at the second video, and you essentially have the same thing, created solely through the activity of individuals.

It is strange therefore that a fully optimised technical system is generally deemed necessary and superior.  When people are left to their own devices, to ‘sort it out between them’, people invariably do.  Traffic in Hanoi is not just the only example of this type of self-organisation – the Internet itself is a creation of human ingenuity.  Following Monderman’s ideas on Shared Space, perhaps all of these traffic regulations, signage and restrictions actually reduce our need to think about what we are doing.  They reduce and remove our ability or will to self-organise, and to the deficit of us all.

So why don’t ‘natural’ answers to technical problems receive a better press?  I suspect it is an issue of trust in the citizen.  That threat that one person may mess up, and mess it up for the rest of us.  Instead of facing the risk and accepting it as part of the solution, we surround ourselves with unnecessary and invasive mechanisms that carry out the task for us.  They may cost a lot of money and not be any better than our current solution, but they feel like progress.  It feels like things are getting better.  So, yes, perhaps automated cars are indeed a thing of the future.

As ever, very interested to hear your thoughts on this.

* I’ve personally never been so sure – mainly because of the safety element, and that fact that many people actually enjoy the process of driving…