Home / Big Data and Hadoop • Spark / Build Your First Application in Spark

12 April 2016

Build Your First Application in Spark

In this blog, we will discuss about implementing your first Spark application by executing the wordcount program and then create a histogram showing the count for various words using Matplotlib package in Python.

We recommend readers to refer our previous blogs on Spark installation and RDD operations.

Link 1: https://acadgild.com/blog/beginners-guide-for-spark/

Link 2: https://acadgild.com/blog/introduction-spark-rdd-basic-operations-rdd/

W e have used the below file, first_app which will act as an input file for building our first application.

Creating RDD from the input file, first_app

We have considered the file, first_app and created RDD by using SparkContext’s textFile method.

Counting the number of lines from RDD

In this step, the number of lines present in the RDD that was created in the previous step is displayed.

Applying Map transformation to create new RDD with total character count

In this step, we will count the number of characters in the myfile RDD and store the results into a python object named as num_char.

Displaying the number of characters in the num_char RDD

In this step, we have counted the total number of characters present in num_char RDD.

Splitting the words from myfile RDD

In this step, we need to extract all the words in myfile RDD by using the given regular expression.

The script below will display the words. Refer the screenshot below where all the words are split.

Filtering words with length greater than 3

We need to filter the words whose length is less than 3 and store that in the RDD filtered_word.

Setting split words with count 1

We have applied the map transformation and all the words have been set with the count 1 and new RDD filtered_word has been created having pair of word and 1.

Refer the script below. A list of key and value pair of each word and 1 set to it is displayed.

Adding the number of occurrences for every key

In this step, reduceByKey will create a new RDD with count of every key received from previous RDD filtered_word1

In the screenshot below, Python script which is required to be written in Spark shell to display the histogram representing the frequency of each word is shown.

The histogram representing the frequency of each word is shown below.

We hope this blog helped you in getting started with Spark development using Python.

Keep visiting our website www.acadgild.com for more blogs on Big Data and other technologies.

Satyam

Satyam Kumar is a Big Data Professional, working in AcadGild with rich experience in Big Data technologies like Hadoop, Spark, NoSQL and other related technologies. He strives to code in Programming languages like Java and Python and have been responsible for development of various projects and blogs related to Hadoop ecosystem and Spark. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields.

Hadoop Tutorial: Combiners in Hadoop

August 25, 2016
Hadoop Tutorial: HBase Admin DDL Commands (Java API)

August 24, 2016
Machine Learning with Spark – Part 3

August 23, 2016

1 Comment

Vishwas Reply to Vishwas

June 24, 2016 at 11:29 am

In the above example, num_char is mentioned as RDD, Its not an RDD – its the result of an action(reduce). Its of Long type(I suppose).
Correct me if wrong

AcadGild

Build Your First Application in Spark

Related

Satyam

Related Posts

1 Comment

Leave a Reply to Vishwas Cancel reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Satyam

Related Posts

1 Comment

Leave a Reply to Vishwas Cancel reply