Home / Spark / Spark Use Case – Youtube Data Analysis

05 April 2016

Spark Use Case – Youtube Data Analysis

This post is about analyzing the data of YouTube. This total analysis is performed using Apache Spark. This YouTube data is publicly available and the data set is described below under the heading Data Set Description.
Using that dataset, we will perform some analysis and will draw out some insights, like what are the top 10 rated videos in YouTube and who uploaded the most number of videos.

Before getting into the Use Case let’s have a brief understanding of Spark with our Beginner’s Guide

This post will help you to understand how to handle data sets that does not have proper structure and how to sort the output of reducer.

Data Set Description

Column 1: Video id of 11 characters.

Column 2: Uploader of the video.

Column 3: Interval between day of establishment of YouTube and the date of uploading of the video.

Column 4: Category of the video.

Column 5: Length of the video.

Column 6: Number of views for the video.

Column 7: Rating on the video.

Column 8: Number of ratings given for the video

Column 9: Number of comments on the videos.

Column 10: Related video ids with the uploaded video.

You can download the data set from here

Problem Statement 1:

Here, we will find out what are the top five categories with maximum number of videos uploaded.

Source Code:

val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt")

val counts = textFile.map(line=>{var YoutubeRecord = ""; val temp=line.split("\t"); ;if(temp.length >= 3) {YoutubeRecord=temp(3)};YoutubeRecord})

val test=counts.map ( x => (x,1) )

val res=test.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(5)

Walk Through of the Above Program:

In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.

In line 2, we are taking each record as input using the map method and extracting the 4^th column, which is the category of the video.

In line 3, we are creating a pair of category_name,1(count) which is used to calculate how many times that the category is present.

In line 4, we are using the reduceByKey method so that all the values of that key are aggregated. Then we are swapping the category_name and its count, and sorting the result with this we will get the sorted records of category_name and its count in descending order. Finally, we are taking the top five from the list.

Output:

(908,Entertainment), (862,Music), (414,Comedy), (398,People & Blogs), (333,News & Politics)

You can this result in the below screenshot.

Problem Statement 2:

In this problem statement, we will find the top 10 rated videos in YouTube.

val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt")

val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 6) true else false} }.map(line=>{line.toString().split("\t")})

val pairs = counts.map(x => {(x(0),x(6).toDouble)})

val res=pairs.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(10)

Walk Through of the Above Program:

In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.

In line2, we are first filtering the lines with more than six elements to avoid ArrayIndexOutOfBounds Exception and then we are using map method to pass the splitted line as output to the next RDD.

In line 3, we are creating a pair of key and value by using the video id, which is the first column and 7^th column 7 respectively.

In line 4, we are using the reduceByKey method to find the ratings of the video, and to sort them by value, we are using the map method and swapping the key and value. Now, values become the keys and we are performing the sortByKey method and sorting the videos based on their rating and taking the top 10 videos.

Output:

(5.0,ZzuGxkWLops), (5.0,O4GzZxcKmFU), (5.0,smGcj6vohLs), (5.0,_KVr7VOTwTQ), (5.0,6yuy9DEK114), (5.0,xd1kn2bFpSM), (5.0,wEQ54SUxtiI), (5.0,lbVnhaqP8F4), (5.0,3V0SjoaPx9A), (5.0,265li8v9m1k)

This output can be seen in the below screenshot.

Hope this post has been helpful in understanding how to perform simple data analysis using Spark and Scala.

Keep visiting our website www.acadgild.com for more updates on Big Data and other technologies.

Kiran Krishna

Kiran Krishna Innamuri is a Passionate Big Data enthusiast having expertise in Hadoop and Spark Development. He is a passionate Java and scala programmer. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields. We provide courses for Android Programming, Big Data, Front End, Full Stack, AngularJS, NodeJS and Android Programming for children.

Machine Learning with Spark: Determining Credibility of a Customer – Part 2

August 22, 2016
Machine Learning with Spark: Determining Credibility of a Customer – Part 1

August 19, 2016
Introduction to Spark 2.X

August 11, 2016

2 Comments

Raj Reply to Raj

April 8, 2016 at 12:25 pm

Hi,

In the problem statement 1, line# 3, you have used ‘var’ to check the column counts and extract column#3. Instead of using ‘var’ you can write like this,

val counts = textFile.map(line => line.split(“\t”)).filter(columns => columns.length >=3).map(columns => columns(3))

or

val counts = textFile.map(_.split(“\t”)).filter(_.length >= 3).map(_(3))
- Satyam Reply to Satyam
  
  April 13, 2016 at 6:32 pm
  
  Hi Raj,
  
  Your approach is also correct, we can use filter also but we have followed a different approach using conditional statements and in our next use case in titanic and olympic we have extracted the results using the filter approach.You can have a look at https://acadgild.com/blog/spark-use-case-titanic-data-analysis/
  Please let us know with your feedbacks for the same.

AcadGild

Spark Use Case – Youtube Data Analysis

Data Set Description

Problem Statement 1:

Walk Through of the Above Program:

Output:

Problem Statement 2:

Walk Through of the Above Program:

Output:

Related

Kiran Krishna

Related Posts

2 Comments

Leave a Reply to Raj Cancel reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Data Set Description

Problem Statement 1:

Walk Through of the Above Program:

Output:

Problem Statement 2:

Walk Through of the Above Program:

Output:

Share this:

Related

Kiran Krishna

Related Posts

2 Comments

Leave a Reply to Raj Cancel reply