Home / Big Data and Hadoop / Skewed Join in Pig

Skewed Join in Pig

11 May 2016

Skewed Join in Pig

In our previous blogs we discussed about Replicated Join and Merge Join in Pig.

In this post we will be continuing our discussion by implementing skewed joins.

Skewed join can be implemented if user’s underlying data is sufficiently skewed and the control needs to be given to user over the allocation of reducer to counteract the skew.

Meaning of skewed data:

Data skew is a situation in distributed processing environment when the data is not evenly divided among the emitted key tuples from the map phase.

This can lead to inconsistent processing times.

In this blog we will be skewing the apache_nobots_tsv.txt file by creating a shell script to append the same row a few thousand times and we rename it to a new file named as skewed_apache_nobots_tsv.txt.

We have to use skewed_apache_nobots_tsv.txt for the implementation of skewed Join.

Type the below scripts in vi editor in Linux to create a skewed data set

To execute the above script file please type the below command in the linux terminal.

And in case if user faces the error like permission denied then we need to change the permission of the folder where this script is present.

After changing the permission the script will be executed,refer the below screenshot for the same.

In the below step we have loaded the skewed dataset into the Pig relation skewed_nobots_weblogs.

In this step we loaded the smaller dataset into the Pig relation ip_country_tbl.

In this step skewed Join is performed on both the relation.

To display first 10 records we used limit command and then dumped the relation filtered_weblog to display the joined records.

We hope this blog helped you in understanding the concepts of skewed join.

Keep visiting our website www.acadgild.com/blog more blogs and EBooks on Big Data and other technologies.

Related

Satyam

Satyam Kumar is a Big Data Professional, working in AcadGild with rich experience in Big Data technologies like Hadoop, Spark, NoSQL and other related technologies. He strives to code in Programming languages like Java and Python and have been responsible for development of various projects and blogs related to Hadoop ecosystem and Spark. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields.

Related Posts

Hadoop Tutorial: Combiners in Hadoop

August 25, 2016
Hadoop Tutorial: HBase Admin DDL Commands (Java API)

August 24, 2016
Machine Learning with Spark – Part 3

August 23, 2016

Leave a Reply

© Copyright 2016. ACADGILD.