06 April 2016

Spark Use Case – Titanic Data Analysis

There have been huge disasters throughout the history of mankind, but the magnitude of the Titanic’s disaster ranks as one of the highest. So much so that the subsequent disasters have always been described as “Titanic in proportion,” implying huge losses.

Anyone who has ever read about the Titanic knows that a perfect combination of natural events and human errors had led to the sinking of the Titanic on its fateful maiden journey from Southampton to New York on April 14, 1912.

There have been several questions put forward to understand the cause(s) of the tragedy; foremost among them is, what made it sink and even more intriguing, how can a 46,000-ton ship sink to the depth of 13,000 feet in a matter of 3 hours? This is a mind-boggling question indeed!

There have been as many investigations as there have and is still poses too many questions and an equal types of analysis methods have been applied to arrive at a conclusion. This post is not about analyzing why or what made the Titanic sink; it is about analysing the data present about the Titanic. This Titanic data is public-ally available and the titanic data set is described below under the heading Data Set Description.

Using this dataset, we will perform some analysis and will draw out some insights, like finding the average age of male and females who died in the Titanic, and the number of males and females who died in each compartment.

Data Set Description:

Column 1: PassengerId

Column 2: Survived (survived=0 & died=1)

Column 3: Pclass

Column 4: Name

Column 5: Sex

Column 6: Age

Column 7: SibSp

Column 8: Parch

Column 9: Ticket

Column 10: Fare

Column 11: Cabin

Column 12: Embarked

You can download the data set from here

Problem Statement 1:

In this problem statement, we will find the average age of males and females who died in the Titanic tragedy.

Source Code:

val textFile = sc.textFile("hdfs://localhost:9000/TitanicData.txt")

val split = textFile.filter { x => {if(x.toString().split(",").length >= 6) true else false} }.map(line=>{line.toString().split(",")})

val key_value = split.filter{x=>if((x(1)=="1")&&(x(5).matches(("\\d+"))))true else false}.map(x => {(x(4),x(5).toInt)})

key_value.mapValues((_, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum)/count}.collectAsMap()

Walk Through of the Above Code:

In line 1, we are creating a new RDD by loading a new dataset which is in HDFS.
In line 2, we are filtering the lines which has more than 7 columns so as to avoid ArrayIndexOutOfBound Exception after the filter we are using the map method to split the line as the dataset is coma ‘,’ delimited we are splitting the line by the parameter ‘,’.
In line 3, we are filtering the splitted by using two conditions. First condition is that the person should be dead (as per the data description survived means 0 and died means 1 in the 2^nd column) and the second condition is that the data in the 6^th column should be numerical (we are achieving it by the regular expression \\d+). If the two conditions are satisfied, then we are creating the key and values. Key is the person’s gender, which is in 5^th column and his/her age is in 6^th column.
In line 4, we are performing the average of the values for each unique key so that we will get the average age of male and female who died. We have achieved it by using the Map Values method. The MapValues method pass each value in the key-value pair RDD through a map function without changing the keys. Now, we are passing the values to the reduceByKey method so that it will add all the values as per their keys. In addition, at the last, we are again using the Map Values method to calculate the average value, by calculating the sum/count and finally, we are collecting the value’s result as map.

Output:

male -> 28.78409090909091, female -> 29.11855670103093

You can see the achieved result in the below screen shot.

Problem Statement 2:

In this problem statement, we will find the number of people who died or survived in each class, along with their gender and age.

In the Titanic, there are 3 classes; 1,2,3, which is in the 3^rd column and the information about their mortality (alive or dead during the tragedy) is present in the 2^nd column (0 for survived and 1 for dead). The details about the gender is present in the 5^th column and their age is present in the 6^th column.

Here, we will club all the four columns as a key, so that we will get the details about the mortality, in each class along with their age and gender.

Source Code:

val textFile = sc.textFile("hdfs://localhost:9000/TitanicData.txt")

val split = textFile.filter { x => {if(x.toString().split(",").length >= 6) true else false} }.map(line=>{line.toString().split(",")})

val count=split.map(x=>(x(1)+" "+x(4)+" "+x(6)+" "+x(2),1)).reduceByKey(_+_).collect

Walk Through of the Above Code:

In line 1, we are creating a new RDD by loading a new dataset, which is in HDFS.
In line 2, we are filtering the lines, which has more than 7 columns, so as to avoid ArrayIndexOutOfBound Exception after the filter. We are using the map method to split the line as the dataset is coma ‘,’ delimited and we are splitting the line by the parameter ‘,’.
In line 3, we are clubbing the columns 2,5,6,3, which contains Survived, Gender, Age, Passenger class respectively, and we are making it as key. Here, we are giving ‘1’ as value so as to count them and we are passing this key and value to the reduceByKey method so that it will count and return the required result.

Output:

(0 male 0 1,59), (0 female 3 3,7), (1 male 2 2,1), (0 female 1 1,2), (1 female 2 1,3), (1 female 3 2,1), (0 female 5 3,1), (0 male 1 2,20), (1 female 0 3,48), (0 male 2 3,7), (0 male 0 3,235), (0 female 1 3,21), (1 female 4 3,2), (1 female 1 2,25), (1 male 1 3,10), (1 male 0 2,9), (1 male 1 1,15), (1 female 0 1,48), (0 female 0 2,3), (0 male 4 3,11), (1 female 2 3,4), (0 male 2 1,1), (0 male 8 3,4), (1 male 2 3,1), (1 male 1 2,7), (0 male 1 3,35), (0 female 8 3,3), (0 male 5 3,4), (1 male 4 3,1), (0 male 3 3,4), (1 female 3 3,1), (0 male 3 1,1), (1 female 0 2,41), (0 female 0 3,33), (0 female 2 3,3), (1 female 3 1,2), (1 female 1 1,38), (1 male 0 1,29), (0 male 2 2,4), (0 female 4 3,4), (1 female 1 3,17), (1 male 0 3,35), (1 female 2 2,3)

The same output can be seen in the below screen shot.

Hope this blog has been helpful in understanding how to perform simple data analysis using Spark and Scala.

Keep visiting our website www.acadgild.com and blog for more updates on Big Data and other technologies.

AcadGild

Spark Use Case – Titanic Data Analysis

Source Code:

Walk Through of the Above Code:

Output:

Problem Statement 2:

Source Code:

Walk Through of the Above Code:

Output:

Related

Kiran Krishna

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Source Code:

Walk Through of the Above Code:

Output:

Problem Statement 2:

Source Code:

Walk Through of the Above Code:

Output:

Share this:

Related

Kiran Krishna

Related Posts

Leave a Reply