In this post, we will be running Pig scripts and Hive queries in both YARN as well as the TEZ engine. We will be analysing how the performance varies and which will be faster whether YARN or TEZ.
Pig Script on Tez and YARN
Now let us write a Pig script for a dictionary called AFINN, in which 2477 words are rated from -5 to +5 based on the words meaning. In the Pig script we will be counting how many positive words (0-5) and negative words (-5 to -1) are there.
The Pig script for calculating the number of negative and positive words in the dictionary looks like as shown below:
1 2 3 4 5 6 7 8 9 |
A = LOAD '/AFINN.txt' USING PigStorage() AS (name:chararray,rating:int); B = FOREACH A GENERATE name,rating,(rating>=0?'positive':'negative') as term:chararray; C = GROUP B by term; D = FOREACH C GENERATE group,COUNT(B.term); STORE D INTO '/AFINN/' |
Now, let’s save the output of the script in HDFS /AFINN/yarn/ directory for YARN output and /AFINN/tez/ for the output from Tez. Let’s assign the name the file containing the above Pig script as dictionary.pig.
Pig on YARN
Let’s run the above code using YARN engine and note down the time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
kiran@ACD-KIRAN:~/Desktop$ pig dictionary.pig 16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 16/02/01 19:16:44 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334404805.log SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.7.1 0.15.0 kiran 2016-02-01 19:16:47 2016-02-01 19:17:08 GROUP_BY Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1454067435808_0016 1 1 3 3 3 3 2 2 2 2 A,B,C,D GROUP_BY,COMBINER /AFINN/yarn, Input(s): Successfully read 2477 records (28452 bytes) from: "/AFINN.txt" Output(s): Successfully stored 2 records (27 bytes) in: "/AFINN/yarn" Counters: Total records written : 2 Total bytes written : 27 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1454067435808_0016 2016-02-01 19:17:08,640 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2016-02-01 19:17:08,643 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2016-02-01 19:17:08,675 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2016-02-01 19:17:08,678 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2016-02-01 19:17:08,714 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 2016-02-01 19:17:08,718 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2016-02-01 19:17:08,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2016-02-01 19:17:08,801 [main] INFO org.apache.pig.Main - Pig script completed in 24 seconds and 88 milliseconds (24088 ms) kiran@ACD-KIRAN:~/Desktop$ |
We can see that YARN took 24 seconds and 88 milliseconds to complete this job. Now, let us run the same script using TEZ engine.
Pig on TEZ
The command for running Pig using Tez engine is as follows:
1 |
pig -x tez dictionary.pig |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
kiran@ACD-KIRAN:~/Desktop$ pig -x tez dictionary.pig 16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL 16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ 16/02/01 19:19:25 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType 2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334565883.log SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2016-02-01 19:19:40,239 [PigTezLauncher-0] INFO org.apache.tez.common.counters.Limits - Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=120 2016-02-01 19:19:40,242 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=SUCCEEDED, progress=TotalTasks: 2 Succeeded: 2 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=Counters: 56 org.apache.tez.common.counters.DAGCounter NUM_SUCCEEDED_TASKS=2 TOTAL_LAUNCHED_TASKS=2 DATA_LOCAL_TASKS=1 AM_CPU_MILLISECONDS=1040 AM_GC_TIME_MILLIS=0 File System Counters FILE_BYTES_READ=146 FILE_BYTES_WRITTEN=82 FILE_READ_OPS=0 FILE_LARGE_READ_OPS=0 FILE_WRITE_OPS=0 HDFS_BYTES_READ=28094 HDFS_BYTES_WRITTEN=27 HDFS_READ_OPS=4 HDFS_LARGE_READ_OPS=0 HDFS_WRITE_OPS=2 org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=2 REDUCE_INPUT_RECORDS=2 COMBINE_INPUT_RECORDS=0 SPILLED_RECORDS=4 NUM_SHUFFLED_INPUTS=1 NUM_SKIPPED_INPUTS=0 NUM_FAILED_SHUFFLE_INPUTS=0 MERGED_MAP_OUTPUTS=1 GC_TIME_MILLIS=140 CPU_MILLISECONDS=3480 PHYSICAL_MEMORY_BYTES=353894400 VIRTUAL_MEMORY_BYTES=1667567616 COMMITTED_HEAP_BYTES=353894400 INPUT_RECORDS_PROCESSED=2477 OUTPUT_RECORDS=2479 OUTPUT_BYTES=39632 OUTPUT_BYTES_WITH_OVERHEAD=46 OUTPUT_BYTES_PHYSICAL=50 ADDITIONAL_SPILLS_BYTES_WRITTEN=0 ADDITIONAL_SPILLS_BYTES_READ=50 ADDITIONAL_SPILL_COUNT=0 SHUFFLE_CHUNK_COUNT=1 SHUFFLE_BYTES=50 SHUFFLE_BYTES_DECOMPRESSED=46 SHUFFLE_BYTES_TO_MEM=0 SHUFFLE_BYTES_TO_DISK=0 SHUFFLE_BYTES_DISK_DIRECT=50 NUM_MEM_TO_DISK_MERGES=0 NUM_DISK_TO_DISK_MERGES=0 SHUFFLE_PHASE_TIME=160 MERGE_PHASE_TIME=172 FIRST_EVENT_RECEIVED=153 LAST_EVENT_RECEIVED=153 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 org.apache.hadoop.mapreduce.TaskCounter COMBINE_INPUT_RECORDS=2 COMBINE_OUTPUT_RECORDS=2477 2016-02-01 19:19:40,267 [PigTezLauncher-0] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2016-02-01 19:19:41,054 [main] INFO org.apache.pig.tools.pigstats.tez.TezPigScriptStats - Script Statistics: HadoopVersion: 2.7.1 PigVersion: 0.15.0 TezVersion: 0.8.1-alpha UserId: kiran FileName: dictionary.pig StartedAt: 2016-02-01 19:19:28 FinishedAt: 2016-02-01 19:19:41 Features: GROUP_BY Success! DAG PigLatin:dictionary.pig-0_scope-0: ApplicationId: job_1454067435808_0017 TotalLaunchedTasks: 2 FileBytesRead: 146 FileBytesWritten: 82 HdfsBytesRead: 28094 HdfsBytesWritten: 27 Input(s): Successfully read 2477 records (28094 bytes) from: "/AFINN.txt" Output(s): Successfully stored 2 records (27 bytes) in: "/AFINN/tez" 2016-02-01 19:19:41,072 [main] INFO org.apache.pig.Main - Pig script completed in 15 seconds and 295 milliseconds (15295 ms) 2016-02-01 19:19:41,072 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool 2016-02-01 19:19:41,085 [Thread-15] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@493238ed 2016-02-01 19:19:41,086 [Thread-15] INFO org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:dictionary.pig, applicationId=application_1454067435808_0017 kiran@ACD-KIRAN:~/Desktop$ |
We can see that Tez completed the job in just 15 seconds and 295 milliseconds.
HIVE ON YARN and TEZ
Here we will create a hive table and load a dictionary dataset which we have into the table and we will run a hive query for calculating the number of positive and negative words are there in the dictionary.
Creation of hive table and loading the dataset is as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
hive> create external table dictionary_yarn(name string,rating INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; OK Time taken: 0.507 seconds hive> LOAD DATA INPATH '/AFINN.txt' into table dictionary_yarn; Loading data to table default.dictionary_yarn Table default.dictionary_yarn stats: [numFiles=1, numRows=0, totalSize=28094, rawDataSize=0] OK Time taken: 0.195 seconds hive> |
HIVE ON YARN
Let’s run the query for counting the number of positive and negative words in the dictionary on YARN engine.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn; Query ID = kiran_20160201195817_827eba29-f2ce-47cd-b491-3c4da6e5d0b2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1454067435808_0023, Tracking URL = http://ACD-KIRAN:8088/proxy/application_1454067435808_0023/ Kill Command = /home/kiran/hadoop-2.7.1/bin/hadoop job -kill job_1454067435808_0023 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-02-01 19:58:23,335 Stage-1 map = 0%, reduce = 0% 2016-02-01 19:58:28,518 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec 2016-02-01 19:58:33,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.04 sec MapReduce Total cumulative CPU time: 3 seconds 40 msec Ended Job = job_1454067435808_0023 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.04 sec HDFS Read: 36808 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 40 msec OK 879 1598 Time taken: 18.282 seconds, Fetched: 1 row(s) hive> |
You can see that Hive on YARN took 18.282 seconds.
HIVE ON TEZ
Now, let’s run the same query on Tez engine.
To make a Hive query run on Tez engine, we need to set the Hive engine explicitly by using the below command:
1 |
set hive.execution.engine=tez; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
hive> set hive.execution.engine=tez; hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn; Query ID = kiran_20160201200130_a5c56388-26f5-48dd-a925-26c5e1d7e2b8 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1454067435808_0024) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.48 s -------------------------------------------------------------------------------- OK 879 1598 Time taken: 10.177 seconds, Fetched: 1 row(s) hive> |
We can see that Hive on Tez took 10.177 seconds to run the same query.
We can check whether the job is running in YARN or TEZ engine by checking it in the Resource manager’s web UI.
localhost:8088
In the above screen shot, we can see the job application_id and its Application type. Application type gives the engine on which the script had run. In the above screen shot, we have the application id’s and their engines for Hive, on which we ran the earlier query.
By this we can say that Tez engine is faster than YARN engine.
Hope this post has provided you a clear picture about running Pig scripts and Hive queries on both YARN and Tez engine and analyzing their performances.
Leave a Reply