A Join simply brings together two data sets. These joins can happen in different ways in Pig – inner, outer, right, left, and outer joins. These however are simple joins and there are specialized joins supported by Pig. The specialized joins are:
- Replicated
- Merge
- Skewed
In this post we will be discussing about the replicated Join in Pig.
Suitable Scenario for a Replicated Join
Suppose there is big data file containing the land line numbers of people across all cities in India and there is a smaller file containing the STD codes (3 digit numbers) for each city in India and if the STD code number has to be prefixed to the respective city for each number in the bigger file – then a replicated join is best suited.
This is because instead of sorting the big file and then applying the Reduce method on each phone number, it is easier to upload the smaller file of STD code to each machine and append the STD code to the landline number by creating a replicated file in each machine.
To demonstrate the Replicated Joins in Pig we will be using apache_nobots_tsv.txt and nobots_ip_country_tsv.txt datasets.
Find the below links for the same:
Link to download apache_nobots_tsv.txt
Link to download nobots_ip_country_tsv.txt
In the below demonstration of the replicated join bigger file is apache_nobots_tsv.txt and the smaller file is nobots_ip_country_tsv.txt.
Find the data description for apache_nobots_tsv.txt which contains around 515 records.
Description of the above dataset:
1st Column: IP ADDRESS
2nd Column: Timestamp
3rd Column: Page name
4th Column: http status
5th Column: Payload
6th Column: user agent
Step 1: Loading of the Large Data set into Pig Relation.
In this step we are loading the apache_nobots_tsv.txt into relation weblogs_nobots.
Refer the below screenshot for the same.
Step 2: Loading of the smaller dataset into Pig Relation.
In this step we will be loading nobots_ip_country_tsv.txt into relation ip_address_country.
Step 3: Joining of the both the relation:
In this step we will performing replicated join on both the relation.
Pig will load the right-most relation, ip_country_tbl, into memory and will join the data with the nobots_weblogs relationship. It is important that the right-most relations be small enough to fit into a mapper’s memory.
Step 4: Dumping the final results.
In this step we will be displaying the final results after join operations and we will be limiting the output to first 5 records.
We hope this blog helped you to understand Replicated joins in pig, in our next blog we will be discussing about Merge joins in pig. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.
Leave a Reply