14 June 2016

Spark Use Case – Analyzing MovieLens Dataset

In this blog, we will discuss a use case involving MovieLens dataset and try to analyze how the movies fare on a rating scale of 1 to 5.

We will start our discussion with the data definition by considering a sample of four records.

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923

Data Definition

Column 1: User ID

Column 2: Movie ID

Column 3: Rating

Column 4: Timestamp

You can download the input file from here.

Below is the code that is used to calculate the number of movies that are rated on a scale of 1 to 5.

from pyspark import SparkConf,SparkContext

import collections

my_lines = sc.textFile('/home/acadgild/ml-100k/u.data')

ratings = lines.map(lambda x : x.split()[2])

res = ratings.countByValue()

my_sortedres = collections.OrderedDict(sorted(res.items()))

for key,value in sortedres.items():

print ("%s %i" %(key,value))

The first two lines of the code import SparkConf ,SparkContext from pyspark libraries that are present in Spark.

SparkContext is the fundamental starting point in Spark that enables us to create RDDs.

Once the collections are imported, the final results have to be sorted out.

The statement in the screenshot below, loads the data file by creating the RDD through sc.textFile method. The data file is loaded into RDD my_lines and the textFile property breaks every line of text into a value in the RDD.

In the screenshot below, expression X is passed on to split function. The 3rd column is extracted and new RDD is created with the new results.

In the screenshot below, the statement countByValue() is executed on ratings RDD to calculate the occurrence of each and every rating starting from 1 to 5 and results are stored in a new Python object res.

Below is the sample output of the above action applied on ratings RDD.

The code below creates an ordered Dictionary to sort the results based on the key which is rating.

The Python code below iterates through the pair of key and value and prints the total number of occurrence of movies falling under one particular rating category.

We hope this blog was helpful in understanding the use of Spark with Python. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

AcadGild

Spark Use Case – Analyzing MovieLens Dataset

Related

Satyam

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Satyam

Related Posts

Leave a Reply