In this post, we will be explaining about embedded Pig, how to embed your Pig script in Java and how to run the embedded Pig Java program through Eclipse.
Let’s take a brief look at Pig first.
What is Apache Pig?
Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level. This is similar to that of SQL for RDBMSs.
For more detailed information on Pig click here.
What is Embedded Pig?
Embedded Pig lets you enable the control flow in Pig scripts. Pig can be embedded in Java, Python, and JavaScript scripting languages using a JDBC- like compile, bind run model.
Now, let us write a Pig script for a word count program. The Pig script for word count program is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
input1 = LOAD '/input' as (line:chararray); words = foreach input1 generate FLATTEN(TOKENIZE(line)) as word; word_groups = group words by word; word_count = foreach word_groups generate group, COUNT(words); ordered_word_count = order word_count by group desc; store ordered_word_count into '/wct_output'; |
Now, let’s embed this Pig script in Java. There are few terms to be noted to embed your Pig script.
Pig provides a PigServer class, using which we can embedded Pig in Java or any other scripting languages. An object should be created for the PigServer class as follows:
PigServer pigServer = new PigServer(ExecType.MAPREDUCE);
In the PigServer, an argument need to passed to inform the Pig about the mode in which it should run Local Mode or MapReduce Mode. Here, ExecType Specifies the execution type of Pig. In our case, we are running Pig in MapReduce mode, so this program will fetch data from HDFS and will create the output file in HDFS as well.
Now, PigServer provides a method called runQuery which runs your Pig script. In the runQuery, we need to write our Pig script. Also, the PigServer provides default methods to register your pig queries.
To include a Pig Latin script, we need to use pigServer.registerQuery method provided by the PigServer.
To include an user defined Jar file, we need to use pigServer.registerJar method provided by the PigServer. This Jar should also be in your HDFS.
PigServer provide different types of methods to include your Pig scripts. You can also write the entire code inside a text document and can be fed to the PigServer using pigServer.registerScript method.
Now we can embedded the above script in Java as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
public static void runQuery(PigServer pigServer) { try { pigServer.registerQuery("input1 = LOAD '/input' as (line:chararray);"); pigServer.registerQuery("words = foreach input1 generate FLATTEN(TOKENIZE(line)) as word;"); pigServer.registerQuery("word_groups = group words by word;"); pigServer.registerQuery("word_count = foreach word_groups generate group, COUNT(words);"); pigServer.registerQuery("ordered_word_count = order word_count by group desc;"); pigServer.registerQuery("store ordered_word_count into '/wct_output';"); } catch(Exception e) { e.printStackTrace(); } } |
The main method is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
public static void main(String[] args) { try { PigServer pigServer = new PigServer(ExecType.MAPREDUCE); runQuery(pigServer); Properties props = new Properties(); props.setProperty("fs.default.name", "hdfs://localhost:9000"); }catch(Exception e) { e.printStackTrace(); } } |
Using the Properties class, we need to give the HDFS path as shown in the above code.
props.setProperty(“fs.default.name”, “hdfs://localhost:9000”);
Your Pig Script is now embedded in Java. Now you can run it by following the below procedure.
How to Run the Embedded Pig
To run an embedded Pig, you need to convert your Java project into a Maven project.
Note: Maven need to be installed in your system and in Eclipse.
To convert into Maven Project, Right Click on the Project–>Configure–>Convert to Maven Project.
After converting, you can see a target file created and inside the target file, you can see pom.xml files. In that file, add the below dependencies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.4</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.16</version> </dependency> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.15.0</version> </dependency> <dependency> <groupId>org.antlr</groupId> <artifactId>antlr-runtime</artifactId> <version>3.4</version> </dependency> </dependencies> |
Note: Here we have used Hadoop 2.7.1 version and Pig 0.15.0 version. You need to change the versions depending on your Hadoop and Pig versions.
Next, add the Jars present in your Lib folder as shown below.
Right click on src–>Build Path–>Configure Build Path–>Libraries–>Add External Jars–>Browse your pig’s lib folder–>Select the files with .Jar extension
Also, add pig-0.15.0-SNAPSHOT-core-h2.jar if you are using Hadoop 2, which is in your Pig folder.
By default, Pig uses Hadoop 0.20 version, so while running, Pig assumes that you are using Hadoop 0.20.You can run Pig with different versions of Hadoop, by setting HADOOP_HOME to point to the directory where you have installed Hadoop.
Set HADOOP_HOME in eclipse
1 |
Run Configurations-->ClassPath-->User Entries-->Advanced-->Add ClassPath Variables-->New-->Name(HADOOP_HOME)-->Path(You Hadoop directory path) |
Before running the program, make sure that all your Hadoop daemons and Job History Server are up and Running. And then only run the program as a normal Java application.
You can see the Hadoop job in the console, as shown below:
My input file contains “Hello all from acadgild!” After the success message you can check for the output directory in your HDFS.
Hope this post helped you in understand Embedded Pig and how to run them using Java. Keep visiting our blog for more updates on Big Data and other technologies.
Leave a Reply