06 February 2016

Performance Analysis of Tez

In this post, we will be running Pig scripts and Hive queries in both YARN as well as the TEZ engine. We will be analysing how the performance varies and which will be faster whether YARN or TEZ.

Pig Script on Tez and YARN

Now let us write a Pig script for a dictionary called AFINN, in which 2477 words are rated from -5 to +5 based on the words meaning. In the Pig script we will be counting how many positive words (0-5) and negative words (-5 to -1) are there.

The Pig script for calculating the number of negative and positive words in the dictionary looks like as shown below:

A = LOAD '/AFINN.txt' USING PigStorage() AS (name:chararray,rating:int);

B = FOREACH A GENERATE name,rating,(rating>=0?'positive':'negative') as term:chararray;

C = GROUP B by term;

D = FOREACH C GENERATE group,COUNT(B.term);

STORE D INTO '/AFINN/'

Now, let’s save the output of the script in HDFS /AFINN/yarn/ directory for YARN output and /AFINN/tez/ for the output from Tez. Let’s assign the name the file containing the above Pig script as dictionary.pig.

Pig on YARN

Let’s run the above code using YARN engine and note down the time.

kiran@ACD-KIRAN:~/Desktop$ pig dictionary.pig

16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/02/01 19:16:44 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/02/01 19:16:44 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2016-02-01 19:16:44,807 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334404805.log

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.7.1 0.15.0 kiran 2016-02-01 19:16:47 2016-02-01 19:17:08 GROUP_BY

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_1454067435808_0016 1 1 3 3 3 3 2 2 2 2 A,B,C,D GROUP_BY,COMBINER /AFINN/yarn,

Input(s):

Successfully read 2477 records (28452 bytes) from: "/AFINN.txt"

Output(s):

Successfully stored 2 records (27 bytes) in: "/AFINN/yarn"

Counters:

Total records written : 2

Total bytes written : 27

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_1454067435808_0016

2016-02-01 19:17:08,640 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032

2016-02-01 19:17:08,643 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server

2016-02-01 19:17:08,675 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032

2016-02-01 19:17:08,678 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server

2016-02-01 19:17:08,714 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032

2016-02-01 19:17:08,718 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server

2016-02-01 19:17:08,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

2016-02-01 19:17:08,801 [main] INFO org.apache.pig.Main - Pig script completed in 24 seconds and 88 milliseconds (24088 ms)

kiran@ACD-KIRAN:~/Desktop$

We can see that YARN took 24 seconds and 88 milliseconds to complete this job. Now, let us run the same script using TEZ engine.

Pig on TEZ

The command for running Pig using Tez engine is as follows:

1	pig -x tez dictionary.pig

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

kiran@ACD-KIRAN:~/Desktop$ pig -x tez dictionary.pig

16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ_LOCAL

16/02/01 19:19:25 INFO pig.ExecTypeProvider: Trying ExecType : TEZ

16/02/01 19:19:25 INFO pig.ExecTypeProvider: Picked TEZ as the ExecType

2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2016-02-01 19:19:25,884 [main] INFO org.apache.pig.Main - Logging error messages to: /home/kiran/Desktop/pig_1454334565883.log

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/kiran/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/kiran/tez/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2016-02-01 19:19:40,239 [PigTezLauncher-0] INFO org.apache.tez.common.counters.Limits - Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=120

2016-02-01 19:19:40,242 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=SUCCEEDED, progress=TotalTasks: 2 Succeeded: 2 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=Counters: 56

org.apache.tez.common.counters.DAGCounter

NUM_SUCCEEDED_TASKS=2

TOTAL_LAUNCHED_TASKS=2

DATA_LOCAL_TASKS=1

AM_CPU_MILLISECONDS=1040

AM_GC_TIME_MILLIS=0

File System Counters

FILE_BYTES_READ=146

FILE_BYTES_WRITTEN=82

FILE_READ_OPS=0

FILE_LARGE_READ_OPS=0

FILE_WRITE_OPS=0

HDFS_BYTES_READ=28094

HDFS_BYTES_WRITTEN=27

HDFS_READ_OPS=4

HDFS_LARGE_READ_OPS=0

HDFS_WRITE_OPS=2

org.apache.tez.common.counters.TaskCounter

REDUCE_INPUT_GROUPS=2

REDUCE_INPUT_RECORDS=2

COMBINE_INPUT_RECORDS=0

SPILLED_RECORDS=4

NUM_SHUFFLED_INPUTS=1

NUM_SKIPPED_INPUTS=0

NUM_FAILED_SHUFFLE_INPUTS=0

MERGED_MAP_OUTPUTS=1

GC_TIME_MILLIS=140

CPU_MILLISECONDS=3480

PHYSICAL_MEMORY_BYTES=353894400

VIRTUAL_MEMORY_BYTES=1667567616

COMMITTED_HEAP_BYTES=353894400

INPUT_RECORDS_PROCESSED=2477

OUTPUT_RECORDS=2479

OUTPUT_BYTES=39632

OUTPUT_BYTES_WITH_OVERHEAD=46

OUTPUT_BYTES_PHYSICAL=50

ADDITIONAL_SPILLS_BYTES_WRITTEN=0

ADDITIONAL_SPILLS_BYTES_READ=50

ADDITIONAL_SPILL_COUNT=0

SHUFFLE_CHUNK_COUNT=1

SHUFFLE_BYTES=50

SHUFFLE_BYTES_DECOMPRESSED=46

SHUFFLE_BYTES_TO_MEM=0

SHUFFLE_BYTES_TO_DISK=0

SHUFFLE_BYTES_DISK_DIRECT=50

NUM_MEM_TO_DISK_MERGES=0

NUM_DISK_TO_DISK_MERGES=0

SHUFFLE_PHASE_TIME=160

MERGE_PHASE_TIME=172

FIRST_EVENT_RECEIVED=153

LAST_EVENT_RECEIVED=153

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

org.apache.hadoop.mapreduce.TaskCounter

COMBINE_INPUT_RECORDS=2

COMBINE_OUTPUT_RECORDS=2477

2016-02-01 19:19:40,267 [PigTezLauncher-0] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-02-01 19:19:41,054 [main] INFO org.apache.pig.tools.pigstats.tez.TezPigScriptStats - Script Statistics:

HadoopVersion: 2.7.1

PigVersion: 0.15.0

TezVersion: 0.8.1-alpha

UserId: kiran

FileName: dictionary.pig

StartedAt: 2016-02-01 19:19:28

FinishedAt: 2016-02-01 19:19:41

Features: GROUP_BY

Success!

DAG PigLatin:dictionary.pig-0_scope-0:

ApplicationId: job_1454067435808_0017

TotalLaunchedTasks: 2

FileBytesRead: 146

FileBytesWritten: 82

HdfsBytesRead: 28094

HdfsBytesWritten: 27

Input(s):

Successfully read 2477 records (28094 bytes) from: "/AFINN.txt"

Output(s):

Successfully stored 2 records (27 bytes) in: "/AFINN/tez"

2016-02-01 19:19:41,072 [main] INFO org.apache.pig.Main - Pig script completed in 15 seconds and 295 milliseconds (15295 ms)

2016-02-01 19:19:41,072 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool

2016-02-01 19:19:41,085 [Thread-15] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@493238ed

2016-02-01 19:19:41,086 [Thread-15] INFO org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:dictionary.pig, applicationId=application_1454067435808_0017

kiran@ACD-KIRAN:~/Desktop$

We can see that Tez completed the job in just 15 seconds and 295 milliseconds.

HIVE ON YARN and TEZ

Here we will create a hive table and load a dictionary dataset which we have into the table and we will run a hive query for calculating the number of positive and negative words are there in the dictionary.

Creation of hive table and loading the dataset is as shown below:

hive> create external table dictionary_yarn(name string,rating INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Time taken: 0.507 seconds

hive> LOAD DATA INPATH '/AFINN.txt' into table dictionary_yarn;

Loading data to table default.dictionary_yarn

Table default.dictionary_yarn stats: [numFiles=1, numRows=0, totalSize=28094, rawDataSize=0]

Time taken: 0.195 seconds

hive>

HIVE ON YARN

Let’s run the query for counting the number of positive and negative words in the dictionary on YARN engine.

hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn;

Query ID = kiran_20160201195817_827eba29-f2ce-47cd-b491-3c4da6e5d0b2

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1454067435808_0023, Tracking URL = http://ACD-KIRAN:8088/proxy/application_1454067435808_0023/

Kill Command = /home/kiran/hadoop-2.7.1/bin/hadoop job -kill job_1454067435808_0023

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2016-02-01 19:58:23,335 Stage-1 map = 0%, reduce = 0%

2016-02-01 19:58:28,518 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec

2016-02-01 19:58:33,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.04 sec

MapReduce Total cumulative CPU time: 3 seconds 40 msec

Ended Job = job_1454067435808_0023

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.04 sec HDFS Read: 36808 HDFS Write: 9 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 40 msec

879 1598

Time taken: 18.282 seconds, Fetched: 1 row(s)

hive>

You can see that Hive on YARN took 18.282 seconds.

HIVE ON TEZ

Now, let’s run the same query on Tez engine.

To make a Hive query run on Tez engine, we need to set the Hive engine explicitly by using the below command:

1	set hive.execution.engine=tez;

hive> set hive.execution.engine=tez;

hive> select sum(case when rating >= 0 then 1 else 0 end) as positive,sum(case when rating < 0 then 1 else 0 end) as negative from dictionary_yarn;

Query ID = kiran_20160201200130_a5c56388-26f5-48dd-a925-26c5e1d7e2b8

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1454067435808_0024)

--------------------------------------------------------------------------------

VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

--------------------------------------------------------------------------------

Map 1 .......... SUCCEEDED 1 1 0 0 0 0

Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0

--------------------------------------------------------------------------------

VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.48 s

--------------------------------------------------------------------------------

879 1598

Time taken: 10.177 seconds, Fetched: 1 row(s)

hive>

We can see that Hive on Tez took 10.177 seconds to run the same query.

We can check whether the job is running in YARN or TEZ engine by checking it in the Resource manager’s web UI.

localhost:8088

In the above screen shot, we can see the job application_id and its Application type. Application type gives the engine on which the script had run. In the above screen shot, we have the application id’s and their engines for Hive, on which we ran the earlier query.

By this we can say that Tez engine is faster than YARN engine.

Hope this post has provided you a clear picture about running Pig scripts and Hive queries on both YARN and Tez engine and analyzing their performances.

AcadGild

Performance Analysis of Tez

Pig Script on Tez and YARN

Pig on YARN

HIVE ON YARN and TEZ

HIVE ON YARN

HIVE ON TEZ

Related

Kiran Krishna

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Pig Script on Tez and YARN

Pig on YARN

HIVE ON YARN and TEZ

HIVE ON YARN

HIVE ON TEZ

Share this:

Related

Kiran Krishna

Related Posts

Leave a Reply