Home / Spark / Spark Use Case – Olympics Data Analysis

07 April 2016

Spark Use Case – Olympics Data Analysis

In this blog, we will discuss on the analysis of Olympics dataset using Apache Spark in Scala.

Olympics data set is a publically available data. Using this dataset, we will evaluate some problem statements such as, finding the number of medals won by each country in swimming, finding the number of medals won by India etc.

Data Set Description

The data set consists of the following fields:

Athlete: Name of the athlete

Age: Age of the athlete

Country: The name of the country participating in Olympics

Year: The year in which Olympics is conducted

Closing Date: Closing date of Olympics

Sport: Sports name

Gold Medals: No. of gold medals

Silver Medals: No. of silver medals

Bronze Medals: No. of bronze medals

Total Medals: Total no. of medals

Dataset Link

https://drive.google.com/drive/folders/0ByJLBTmJojjzVGNsWmpUUUxTZDA

Problem Statement 1

Find the total number of medals won by each country in swimming.

Source code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

val fil = counts.filter(x=>{if(x(5).equalsIgnoreCase("swimming")&&(x(9).matches(("\\d+")))) true else false })

val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))

val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.

Line 3: We will get the records which have 11 columns and here, we are again filtering the records under the sport ‘swimming’ because we need to find out the number of medals won by countries in swimming. We are also checking whether the 10^th column has a digit or not.

Line 4: We are creating a pair RDD (String,Int) where the key is country name and value is the number of medals it won in swimming.

Line 5: We are counting the number of medals that each country won in swimming by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(Australia,163), (Hungary,9), (Brazil,8), (Canada,5), (Japan,43), (Netherlands,46), (Belarus,2), (Sweden,9), (Serbia,1), (Slovakia,2), (Norway,2), (Denmark,1), (Poland,3), (Trinidad and Tobago,1), (Great Britain,11), (Argentina,1), (Croatia,1), (Lithuania,1), (Zimbabwe,7), (China,35), (Slovenia,1), (South Korea,4), (Italy,16), (Spain,3), (Germany,32), (Costa Rica,2), (France,39), (Tunisia,3), (Ukraine,7), (United States,267), (South Africa,11), (Romania,6), (Russia,20), (Austria,3)

You can see the same in the below screen shot.

Figure 1

Problem Statement 2

Find the number of medals that India won year wise.

Source code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

val fil = counts.filter(x=>{if(x(2).equalsIgnoreCase("india")&&(x(9).matches(("\\d+")))) true else false })

val pairs: RDD[(String, Int)] = fil.map(x => (x(3),x(9).toInt))

val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the records which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.

Line 3: We will get the records which have 11 columns and here, we are again filtering the records for the country ‘India’ as we need to find out the number of medals won by India. Also, we are checking whether the 10^th column has a digit or not.

Line 4: We are creating a pair RDD(String,Int) where key is the year and value is the number of medals won in that year.

Line 5: We are counting the number of medals won by India, year wise by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(2008,3), (2004,1), (2000,1), (2012,6)

You can see the same in the below screen shot.

Figure 2

Problem Statement 3

Find the total number of medals won by each country.

Source Code

val textFile = sc.textFile("hdfs://localhost:9000/olympix_data.csv")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 10) true else false} }.map(line=>{line.toString().split("\t")})

val fil = counts.filter(x=>{if((x(9).matches(("\\d+")))) true else false })

val pairs: RDD[(String, Int)] = fil.map(x => (x(2),x(9).toInt))

val cnt = pairs.reduceByKey(_ + _).collect()

Description of the Above Code

Line1: We are creating an RDD with the existing dataset which is inside HDFS.

Line 2: We are taking each record as input and filtering the lines which do not have 11 columns. This is useful in eliminating ArrayIndexOutofBound exception.

Line 3: We will get the records which have 11 columns and we are again filtering the records which has a digit in 10^th column.

Line 4: We are creating a pair RDD(String,Int) where key is the country and value is the number of medals won by the country.

Line 5: We are counting the number of medals won by each country by using the reduceByKey method and finally we are displaying it by using collect() method.

Output

(Australia,609), (Great Britain,322), (Brazil,221), (Canada,370), (Uzbekistan,19), (Barbados,1), (Japan,282), (Cyprus,1), (Finland,118), (Singapore,7), (Montenegro,14), (Uruguay,1), (Moldova,5), (Colombia,13), (Sweden,181), (Vietnam,2), (Serbia,31), (Iran,24), (Slovakia,35), (Mozambique,1), (Cameroon,20), (Denmark,89), (Turkey,28), (Panama,1), (Saudi Arabia,6), (Hungary,145), (Portugal,9), (Paraguay,17), (Jamaica,80), (Georgia,23), (Dominican Republic,5), (Kyrgyzstan,3), (Netherlands,318), (Iceland,15), (Morocco,11), (Belarus,97), (Mongolia,10), (Kazakhstan,42), (Kenya,39), (Syria,1), (Indonesia,22), (Eritrea,1), (Uganda,1), (Norway,192), (Puerto Rico,2), (Poland,80), (Tajikistan,3), (Grenada,1), (Trinidad and Tobago,19), (Afghanistan,2), (Israel,4)

You can see the same in the below screen shot.

Figure 3

We hope this blog helped you in understanding dataset analysis using Spark. Keep visiting our site www.acadgild.com for more blogs on Big Data and other technologies.

Kiran Krishna

Kiran Krishna Innamuri is a Passionate Big Data enthusiast having expertise in Hadoop and Spark Development. He is a passionate Java and scala programmer. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields. We provide courses for Android Programming, Big Data, Front End, Full Stack, AngularJS, NodeJS and Android Programming for children.

Machine Learning with Spark: Determining Credibility of a Customer – Part 2

August 22, 2016
Machine Learning with Spark: Determining Credibility of a Customer – Part 1

August 19, 2016
Introduction to Spark 2.X

August 11, 2016

3 Comments

kirankumar Reply to kirankumar

May 5, 2016 at 2:33 pm

Hi ,

Is there any one help me …in Spark -Hbase Integration
I am able to write the data to Hbase but I am trying to read it form Hbase ?
If any sample code please forward to kumardwh@outlook.com
Vasu Reply to Vasu

June 21, 2016 at 11:26 pm

Hi Kiran,

Thank you so much for this precious information, but can be able to load CSV file using textFile(…csv)? i dont think so, could you please let me know (if you have a chance) how can we convert CSV file into RDD? Thank you!!!

Vasu
- Kiran Krishna Reply to Kiran
  
  June 21, 2016 at 11:36 pm
  
  Hi vasu,
  
  textFile() in spark can load any type of file. Everything in spark is an RDD, it will load the dataset and distribute among the Worker nodes. For more information on RDD’s you can refer to our Beginner’s Guide.

AcadGild

Spark Use Case – Olympics Data Analysis

Related

Kiran Krishna

Related Posts

3 Comments

Leave a Reply to Vasu Cancel reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Kiran Krishna

Related Posts

3 Comments

Leave a Reply to Vasu Cancel reply