In this blog, we will discuss about implementing your first Spark application by executing the wordcount program and then create a histogram showing the count for various words using Matplotlib package in Python.
We recommend readers to refer our previous blogs on Spark installation and RDD operations.
Link 1: https://acadgild.com/blog/beginners-guide-for-spark/
Link 2: https://acadgild.com/blog/introduction-spark-rdd-basic-operations-rdd/
W e have used the below file, first_app which will act as an input file for building our first application.
Creating RDD from the input file, first_app
We have considered the file, first_app and created RDD by using SparkContext’s textFile method.
Counting the number of lines from RDD
In this step, the number of lines present in the RDD that was created in the previous step is displayed.
Applying Map transformation to create new RDD with total character count
In this step, we will count the number of characters in the myfile RDD and store the results into a python object named as num_char.
Displaying the number of characters in the num_char RDD
In this step, we have counted the total number of characters present in num_char RDD.
Splitting the words from myfile RDD
In this step, we need to extract all the words in myfile RDD by using the given regular expression.
The script below will display the words. Refer the screenshot below where all the words are split.
Filtering words with length greater than 3
We need to filter the words whose length is less than 3 and store that in the RDD filtered_word.
Setting split words with count 1
We have applied the map transformation and all the words have been set with the count 1 and new RDD filtered_word has been created having pair of word and 1.
Refer the script below. A list of key and value pair of each word and 1 set to it is displayed.
Adding the number of occurrences for every key
In this step, reduceByKey will create a new RDD with count of every key received from previous RDD filtered_word1
In the screenshot below, Python script which is required to be written in Spark shell to display the histogram representing the frequency of each word is shown.
The histogram representing the frequency of each word is shown below.
We hope this blog helped you in getting started with Spark development using Python.
Keep visiting our website www.acadgild.com for more blogs on Big Data and other technologies.
In the above example, num_char is mentioned as RDD, Its not an RDD – its the result of an action(reduce). Its of Long type(I suppose).
Correct me if wrong