In this blog, we will discuss on the analysis of Olympics dataset using Apache Spark in Scala.
Olympics data set is a publically available data. Using this dataset, we will evaluate some problem statements such as, finding the number of medals won by each country in swimming, finding the number of medals won by India etc.
Data Set Description
The data set consists of the following fields:
Athlete: Name of the athlete
Age: Age of the athlete
Country: The name of the country participating in Olympics
Year: The year in which Olympics is conducted
Closing Date: Closing date of Olympics
Sport: Sports name
Gold Medals: No. of gold medals
Silver Medals: No. of silver medals
Bronze Medals: No. of bronze medals
Total Medals: Total no. of medals
Dataset Link
https://drive.google.com/drive/folders/0ByJLBTmJojjzVGNsWmpUUUxTZDA
Problem Statement 1
Find the total number of medals won by each country in swimming.
Source code
1 2 3 4 5 |
val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv") val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")}) val fil = counts.filter(x=>{if(x(5).equalsIgnoreCase("swimming")&&(x(9).matches(("\\d+")))) true else false }) val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt)) val cnt = pairs.reduceByKey(_ + _).collect() |
Description of the Above Code
Line1: We are creating an RDD with the existing dataset which is inside HDFS.
Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.
Line 3: We will get the records which have 11 columns and here, we are again filtering the records under the sport ‘swimming’ because we need to find out the number of medals won by countries in swimming. We are also checking whether the 10th column has a digit or not.
Line 4: We are creating a pair RDD (String,Int) where the key is country name and value is the number of medals it won in swimming.
Line 5: We are counting the number of medals that each country won in swimming by using the reduceByKey method and finally we are displaying it by using collect() method.
Output
(Australia,163), (Hungary,9), (Brazil,8), (Canada,5), (Japan,43), (Netherlands,46), (Belarus,2), (Sweden,9), (Serbia,1), (Slovakia,2), (Norway,2), (Denmark,1), (Poland,3), (Trinidad and Tobago,1), (Great Britain,11), (Argentina,1), (Croatia,1), (Lithuania,1), (Zimbabwe,7), (China,35), (Slovenia,1), (South Korea,4), (Italy,16), (Spain,3), (Germany,32), (Costa Rica,2), (France,39), (Tunisia,3), (Ukraine,7), (United States,267), (South Africa,11), (Romania,6), (Russia,20), (Austria,3)
You can see the same in the below screen shot.
Figure 1
Problem Statement 2
Find the number of medals that India won year wise.
Source code
1 2 3 4 5 |
val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv") val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")}) val fil = counts.filter(x=>{if(x(2).equalsIgnoreCase("india")&&(x(9).matches(("\\d+")))) true else false }) val pairs: RDD[(String, Int)] = fil.map(x => (x(3),x(9).toInt)) val cnt = pairs.reduceByKey(_ + _).collect() |
Description of the Above Code
Line1: We are creating an RDD with the existing dataset which is inside HDFS.
Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.
Line 3: We will get the records which have 11 columns and here, we are again filtering the records for the country ‘India’ as we need to find out the number of medals won by India. Also, we are checking whether the 10th column has a digit or not.
Line 4: We are creating a pair RDD(String,Int) where key is the year and value is the number of medals won in that year.
Line 5: We are counting the number of medals won by India, year wise by using the reduceByKey method and finally we are displaying it by using collect() method.
Output
(2008,3), (2004,1), (2000,1), (2012,6)
You can see the same in the below screen shot.
Figure 2
Problem Statement 3
Find the total number of medals won by each country.
Source Code
1 2 3 4 5 |
val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv") val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")}) val fil = counts.filter(x=>{if((x(9).matches(("\\d+")))) true else false }) val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt)) val cnt = pairs.reduceByKey(_ + _).collect() |
Description of the Above Code
Line1: We are creating an RDD with the existing dataset which is inside HDFS.
Line 2: We are taking each record as input and filtering the lines which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.
Line 3: We will get the records which have 11 columns and we are again filtering the records which has a digit in 10th column.
Line 4: We are creating a pair RDD(String,Int) where key is the country and value is the number of medals won by the country.
Line 5: We are counting the number of medals won by each country by using the reduceByKey method and finally we are displaying it by using collect() method.
Output
(Australia,609), (Great Britain,322), (Brazil,221), (Canada,370), (Uzbekistan,19), (Barbados,1), (Japan,282), (Cyprus,1), (Finland,118), (Singapore,7), (Montenegro,14), (Uruguay,1), (Moldova,5), (Colombia,13), (Sweden,181), (Vietnam,2), (Serbia,31), (Iran,24), (Slovakia,35), (Mozambique,1), (Cameroon,20), (Denmark,89), (Turkey,28), (Panama,1), (Saudi Arabia,6), (Hungary,145), (Portugal,9), (Paraguay,17), (Jamaica,80), (Georgia,23), (Dominican Republic,5), (Kyrgyzstan,3), (Netherlands,318), (Iceland,15), (Morocco,11), (Belarus,97), (Mongolia,10), (Kazakhstan,42), (Kenya,39), (Syria,1), (Indonesia,22), (Eritrea,1), (Uganda,1), (Norway,192), (Puerto Rico,2), (Poland,80), (Tajikistan,3), (Grenada,1), (Trinidad and Tobago,19), (Afghanistan,2), (Israel,4)
You can see the same in the below screen shot.
Figure 3
We hope this blog helped you in understanding dataset analysis using Spark. Keep visiting our site www.acadgild.com for more blogs on Big Data and other technologies.
Hi ,
Is there any one help me …in Spark -Hbase Integration
I am able to write the data to Hbase but I am trying to read it form Hbase ?
If any sample code please forward to kumardwh@outlook.com
Hi Kiran,
Thank you so much for this precious information, but can be able to load CSV file using textFile(…csv)? i dont think so, could you please let me know (if you have a chance) how can we convert CSV file into RDD? Thank you!!!
Vasu
Hi vasu,
textFile() in spark can load any type of file. Everything in spark is an RDD, it will load the dataset and distribute among the Worker nodes. For more information on RDD’s you can refer to our Beginner’s Guide.