Predicting London Crime Rates Using Machine Learning
- by 7wData
Predicting the number and even the type of crimes being committed in the Greater London area each month is no easy task, but here’s how I cracked it, with Dataiku Data Science Studio (DSS).
This blog post was updated in February 2017 to include all 2016 data and make predictions for 2017.Â
In 2014,London police started trialing software designed by Accentureto identify gang members that were likely to commit violent crimes or reoffend. It began an unprecedented study drawing on five years of data that included previous crime rates and social media activity. Using Big Data to fight crime is clearly not entirely novel, but I wanted to take this further, especially with all the open data about crime that’s out there.
The Greater London police (Metropolitan and London) are doing a great job at fighting crime, providing data, and mapping the results, but what’s interesting is to try to make predictions and not just have a view on the past data.
We might already have an idea of who is likely to commit a crime, but how many crimes would this result in, and what would be the nature of these crimes? This was the kind of information I hoped to predict, so I tried two different predictive models: crime month-by-month at the LSOA level and crime type (whether burglary, bicycle theft, arson, etc.) month-by-month at the LSOA level.
About LSOA: LSOA is a census area containing 1,000 to 3,000 people. Here’s thefull definition from ONS.
So, where to begin? I sourced the data from the open source crime database on theUK police portal, selecting data from 2011 to 2016 pertaining to Greater London (central London and the surrounding metropolitan area).
The main source of data I used is available here - I selected the metropolitan and London areas. I also used UK census information, Point of Interests (POIs), and the geographical locations of police stations.
I enriched the dataset with various open data sources, added the police station coordinates, and added postcodes. I also inputted POIs and the LSOA statistics.
To prepare the dataset for training the machine learning models, I created a geohash based on latitude and longitude coordinates. I cleaned the data for recoding, filling empty values and structuring, which is super simple in Dataiku DSS.
I then created clusters for the LSOA in order to define the criminality profiles and their levels - three clusters and one outlier were found. The different datasets could then be joined.
I built two models, the first for prediction per LSOA per month and the second for prediction per LSOA per month per crime type.
I collected the POIs, cleaned the data, and created a geohash for each latitude/longitude coordinate, and then loaded it into a HPE Vertica database. Then I was ready to collect the crimes from 2011 to 2016 and to clean this data.
Here is an overview of the first data preparation step:
I have developed a geohashing plugin for transforming the XY coordinates into categorical values. If you are not familiar with DSS plugins, you can find out more here - plugins are super useful for packaging a methodology and adding new functions to Dataiku DSS.
Let’s have a first look at the volume of crime data we collected. For this, I created a chart of the number of crimes by year with Dataiku DSS:
I decided to work with crime data from 2012 to 2015 and then predict for 2016.The second step was to predict the number of crimes in 2017 based on the 2016 model. The first pleasant surprise was seeing the number of crimes decreasing.But I was less surprised by the re-categorization of crimes. This is often the case in other industries when, for operational reasons, a category is splitted or merged.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More