This post is about analyzing the data of YouTube. This total analysis is performed using Apache Spark. This YouTube data is publicly available and the data set is described below under the heading Data Set Description.
Using that dataset, we will perform some analysis and will draw out some insights, like what are the top 10 rated videos in YouTube and who uploaded the most number of videos.
Before getting into the Use Case let’s have a brief understanding of Spark with our Beginner’s Guide
This post will help you to understand how to handle data sets that does not have proper structure and how to sort the output of reducer.
Data Set Description
Column 1: Video id of 11 characters.
Column 2: Uploader of the video.
Column 3: Interval between day of establishment of YouTube and the date of uploading of the video.
Column 4: Category of the video.
Column 5: Length of the video.
Column 6: Number of views for the video.
Column 7: Rating on the video.
Column 8: Number of ratings given for the video
Column 9: Number of comments on the videos.
Column 10: Related video ids with the uploaded video.
You can download the data set from here
Problem Statement 1:
Here, we will find out what are the top five categories with maximum number of videos uploaded.
Source Code:
1 2 3 4 |
val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt") val counts = textFile.map(line=>{var YoutubeRecord = ""; val temp=line.split("\t"); ;if(temp.length >= 3) {YoutubeRecord=temp(3)};YoutubeRecord}) val test=counts.map ( x => (x,1) ) val res=test.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(5) |
Walk Through of the Above Program:
-
In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.
-
In line 2, we are taking each record as input using the map method and extracting the 4th column, which is the category of the video.
-
In line 3, we are creating a pair of category_name,1(count) which is used to calculate how many times that the category is present.
In line 4, we are using the reduceByKey method so that all the values of that key are aggregated. Then we are swapping the category_name and its count, and sorting the result with this we will get the sorted records of category_name and its count in descending order. Finally, we are taking the top five from the list.
Output:
(908,Entertainment), (862,Music), (414,Comedy), (398,People & Blogs), (333,News & Politics)
You can this result in the below screenshot.
Problem Statement 2:
In this problem statement, we will find the top 10 rated videos in YouTube.
1 2 3 4 |
val textFile = sc.textFile("hdfs://localhost:9000/youtubedata.txt") val counts = textFile.filter { x => {if(x.toString().split("\t").length >= 6) true else false} }.map(line=>{line.toString().split("\t")}) val pairs = counts.map(x => {(x(0),x(6).toDouble)}) val res=pairs.reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(10) |
Walk Through of the Above Program:
-
In line 1, we are creating an RDD with the existing dataset, which is inside HDFS.
-
In line2, we are first filtering the lines with more than six elements to avoid ArrayIndexOutOfBounds Exception and then we are using map method to pass the splitted line as output to the next RDD.
-
In line 3, we are creating a pair of key and value by using the video id, which is the first column and 7th column 7 respectively.
-
In line 4, we are using the reduceByKey method to find the ratings of the video, and to sort them by value, we are using the map method and swapping the key and value. Now, values become the keys and we are performing the sortByKey method and sorting the videos based on their rating and taking the top 10 videos.
Output:
(5.0,ZzuGxkWLops), (5.0,O4GzZxcKmFU), (5.0,smGcj6vohLs), (5.0,_KVr7VOTwTQ), (5.0,6yuy9DEK114), (5.0,xd1kn2bFpSM), (5.0,wEQ54SUxtiI), (5.0,lbVnhaqP8F4), (5.0,3V0SjoaPx9A), (5.0,265li8v9m1k)
This output can be seen in the below screenshot.
Hope this post has been helpful in understanding how to perform simple data analysis using Spark and Scala.
Keep visiting our website www.acadgild.com for more updates on Big Data and other technologies.
Hi,
In the problem statement 1, line# 3, you have used ‘var’ to check the column counts and extract column#3. Instead of using ‘var’ you can write like this,
val counts = textFile.map(line => line.split(“\t”)).filter(columns => columns.length >=3).map(columns => columns(3))
or
val counts = textFile.map(_.split(“\t”)).filter(_.length >= 3).map(_(3))
Hi Raj,
Your approach is also correct, we can use filter also but we have followed a different approach using conditional statements and in our next use case in titanic and olympic we have extracted the results using the filter approach.You can have a look at https://acadgild.com/blog/spark-use-case-titanic-data-analysis/
Please let us know with your feedbacks for the same.