In this post, we will be analyzing the crimes dataset of New York using SparkSQL. In case you are not familiar with SparkSQL, please refer to our post on Introduction to SparkSQL.
Dataset Description:
This dataset is available publically, reflects the reported incidents of crime (with the exception of murders, where data exists for each victim) that has occurred in the City of Chicago from 2001 to present. The data is extracted from the New York Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
You can download the dataset from here.
Below is a sample record from the dataset
You can click here for the complete data set column wise description.
In this post, we will be using pyspark shell for writing our queries.
Problem Statement:
-
Find number of crimes that happened under each FBI code.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
#import SQLContext and row from pyspark.sql import SQLContext,Row sqlContext=SQLContext(sc) #load the data set and split the records lines =sc.textFile("hdfs://localhost:9000/Crime_dataset") parts = lines.map(lambda l: l.split(",")) # construct the Rows by by passing a list of key/value pairs as kwargs Crimes = parts.map(lambda p:Row(Id =p[0],case_no=p[1],date=p[2],block=p[3],IUCR=p[4],Primary_type=p[5],description=p[6],Loc_des =p[7],arrest=p[8],domestic= p[9],beat=p[10],district=p[11],ward=p[12],community=p[13],fbicode=p[14],XCor=p[15],YCor=p[16],year=p[17],Updated_on=p[18],lattitude=p[19],longi=p[20],loc=p[21])) # Create the DataFrame and register it has Table schema1=sqlContext.createDataFrame(Crimes) schema1.registerTempTable("Crimes") #run the query for getting the required result result=sqlContext.sql("select fbicode,count(fbicode) as count from Crimes group by fbicode") result.show() |
Output:
2. Find number of ‘NARCOTICS’ cases filed in the year 2015.
We have already read the data created from the Data Frame and registered as a table with the name ‘Çrimes’, in the first problem statement. Now, we can directly run the queries on this table.
Query:
1 2 3 |
result=sqlContext.sql("select count(*) as count from Crimes where Primary_type ='NARCOTICS' and year = 2015 ") result.show() |
3. Find the number of theft related arrests that happened in each district.
result=sqlContext.sql(“select district ,count(*) as count from Crimes where Primary_type =’THEFT’ and arrest = ‘true’ group by district “) result.show()
We hope this blog helped you in getting grip over SparkSQL concepts.Keep visiting our website for more blogs on Big Data,Spark and other technologies.
Leave a Reply