28 April 2016

Analyzing New York Crime Data Using SparkSQL

In this post, we will be analyzing the crimes dataset of New York using SparkSQL. In case you are not familiar with SparkSQL, please refer to our post on Introduction to SparkSQL.

Dataset Description:

This dataset is available publically, reflects the reported incidents of crime (with the exception of murders, where data exists for each victim) that has occurred in the City of Chicago from 2001 to present. The data is extracted from the New York Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

You can download the dataset from here.

Below is a sample record from the dataset

You can click here for the complete data set column wise description.

In this post, we will be using pyspark shell for writing our queries.

Problem Statement:

Find number of crimes that happened under each FBI code.

Code:

#import SQLContext and row

from pyspark.sql import SQLContext,Row

sqlContext=SQLContext(sc)

#load the data set and split the records

lines =sc.textFile("hdfs://localhost:9000/Crime_dataset")

parts = lines.map(lambda l: l.split(","))

# construct the Rows by by passing a list of key/value pairs as kwargs

Crimes = parts.map(lambda p:Row(Id =p[0],case_no=p[1],date=p[2],block=p[3],IUCR=p[4],Primary_type=p[5],description=p[6],Loc_des =p[7],arrest=p[8],domestic= p[9],beat=p[10],district=p[11],ward=p[12],community=p[13],fbicode=p[14],XCor=p[15],YCor=p[16],year=p[17],Updated_on=p[18],lattitude=p[19],longi=p[20],loc=p[21]))

# Create the DataFrame and register it has Table

schema1=sqlContext.createDataFrame(Crimes)

schema1.registerTempTable("Crimes")

#run the query for getting the required result

result=sqlContext.sql("select fbicode,count(fbicode) as count from Crimes group by fbicode")

result.show()

Output:

2. Find number of ‘NARCOTICS’ cases filed in the year 2015.

We have already read the data created from the Data Frame and registered as a table with the name ‘Çrimes’, in the first problem statement. Now, we can directly run the queries on this table.

Query:

result=sqlContext.sql("select count(*) as count from Crimes where Primary_type ='NARCOTICS' and year = 2015 ")

result.show()

3. Find the number of theft related arrests that happened in each district.

result=sqlContext.sql(“select district ,count(*) as count from Crimes where Primary_type =’THEFT’ and arrest = ‘true’ group by district “) result.show()

We hope this blog helped you in getting grip over SparkSQL concepts.Keep visiting our website for more blogs on Big Data,Spark and other technologies.

AcadGild

Analyzing New York Crime Data Using SparkSQL

Related

Brundesh

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Brundesh

Related Posts

Leave a Reply