27 April 2016

Spark RDD Operations in Scala Part – 2

In our previous post, we had discussed about the basic RDD operations in Scala. Now, let’s discuss about some of the advanced RDD operations in Scala.

Here we have taken two datasets, dept and emp, to work on this operations. The datasets looks like this:

[DeptNo DeptName] [Emp_no DOB FName Lname gender HireDate DeptNo]

Both the datasets are delimited by tab.

Union:

The Union operation results in an RDD which contains the elements of both the RDD’s. You can refer to the below screen shot to see how the Union operation performs.

Here, we have created two RDDs and loaded the two datasets into them. We have performed Union operation on them, and from the result you can see that both the datasets are combined and have printed the first 10 records of the newly obtained RDD. Here the 10^th record is the first record of the second dataset.

Intersection:

Intersection returns the elements of both the RDD’s. Refer the below screen shot to know how to perform intersection.

Here we have split the datasets by using tab delimiter and have extracted the 1^st column from the first dataset and the 7^th column from the second dataset. We have also performed intersection on the datasets and the result is as displayed.

Cartesian:

The Cartesian operation will return the RDD containing the Cartesian product of the elements contained in both the RDDs. You can refer to the below screen shot for the same.

Here we have split the datasets by using tab delimiter and have extracted 1^st column from the first dataset and 7^th column from the second dataset. Then, we have performed the Cartesian operation on the RDDs and the results are displayed.

Subtract:

The Subtract operation will remove the common elements present in both the RDDs. You can refer to the below screen shot for the same.

Here, we have split the datasets by using tab delimiter and have extracted the 1^st column from the first dataset and the 7^th column from the second dataset. Then we have performed the Subtract operation on the RDDs and the results are displayed.

Foreach:

The foreach operation is used to iterate every element in the RDD. You can refer to the below screen shot for the same.

In the above screen shot, you can see that every element in the RDD emp are printed in a separate line.

Operations on Paired RDD’s:

Creating Pair RDD:

Here, we will create a RDD pair which consists of key and value pairs. To create a pair RDD, we need to import the RDD package by using the below statement:

import org.apache.spark.rdd.RDD

You can refer to the below screen shot for the same.

Here, we have split the dataset by using the tab as delimiter and made the key value pairs as shown in the above screen shot.

Keys:

The Keys operation is used to print all the keys in the RDD pair. You can refer to the below screen shot for the same.

Values:

The Values operation is used to print all the values in the RDD pair. You can refer to the below screen shot for the same.

SortByKey:

The SortByKey operation returns the RDD that contains the key value pairs sorted by Keys. SortByKey accepts arguments true/false. ‘False’ will sort the keys in descending order and ‘True’ will sort the keys in ascending order. You can refer to the below screen shot for the same.

RDD’s holding Objects:

Here, by using the case class, we will declare one object and will pass this case class as parameter to the RDD. You can refer to the below screen shot for the same.

Join:

The Join operation is used to join two RDDs. The default Join will be Inner join. You can refer to the below screen shot for the same.

Here, we have taken two case classes for the two datasets and have created two RDDs with the two datasets as the common element as key and the rest of the contents as value and have performed Join operation on the RDDs and the result is as displayed on the screen.

RighOuterJoin:

The RightOuterJoin operation returns the joined elements of both the RDDs, where the key must be present in the first RDD. You can refer to the below screen shot for the same.

Here, we have taken two case classes for the two datasets and have created two RDDs with the two datasets as the common element as key and the rest of the contents as values and we have performed rightOuterJoin operation on the RDDs and the result is as displayed on the screen.

LeftOuterJoin:

The LeftOuterJoin operation returns the joined elements of both the RDDs, where the key must be present in the second RDD. You can refer to the below screen shot for the same.

Here, we have taken two case classes for the two datasets and we have created two RDDs with the two datasets as the common element as key and the rest of the contents as value and we have performed the LeftOuterJoin operation on the RDDs and the result is as displayed on the screen.

CountByKey:

The CountByKEy operation returns the number of elements present for each key. You can refer to the below screenshot for the same.

Here, we have loaded the dataset and split the records by using tab as delimiter and created the pair as DeptNo and DeptName. Then, we have performed CountByKey operation and the result is as displayed.

SaveAsTextFile:

The SaveAsTExtFile operation stores the result of the RDD in a text File in the given output path. You can refer to the below screenshot for the same.

Hope this post has been helpful in understanding the advanced RDD operations in Scala. In case of any queries, feel free to drop us a comment below or email us at support@acadgild.com.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

AcadGild

Spark RDD Operations in Scala Part – 2

Related

Kiran Krishna

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Kiran Krishna

Related Posts

Leave a Reply