Hive sometimes can take lot of time to complete a job.
The jobs may have many different stages to get completed.By default, Hive executes these stages one at a time.
Different stages may include a Map stage,Reduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly.
Hive can converts a query into one or more stages and to save time executes multiple jobs parallely.
For basics on HIVE and multiple instace in HIVE follow the blogs linked.
NOTE:- We have done this exercise in single-node cluster, which on execution of query shares the single resource in order to complete the task(i.e taking longer time to complete multiple jobs), fully distributed hadoop cluster would be best platform to see the actual time difference in job execution.
Below is the result for a sample query fired inside Hive shell when parallel processing is turned off.
You can note only one job was assigned to mapper .
The query fired is:
SELECT
table1.a
FROM
table1 JOIN table2 ON (table1.a =table2.a )
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);
If the query is optimized and more stages are run simultaneously, the job may complete much faster.
However,If a job is running more stages in parallel, it will increase its cluster utilization.
NOTE:-Developer must keep track not to occupy complete bandwidth of cluster.
The configuration file named hive-site.xml shall be created in hive/conf/ directory, where we need to change the properties for parallel execution and override this property.
Refer the screenshot for default property.
We can do enable parallel execution of job stages by setting hive.exec.parallel to true .
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
Also numbers of mappers assigned to execute parallel processing can also be controlled by following tag.
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
Once property is set save it and restart the hive shell in new terminal.
Shoot the optimized query inside shell to see multiple jobs launching .
Refer the result below(red coloured):
SELECT
r1.a
FROM
(SELECT table1.a FROM table1 JOIN table2 ON table1.a =table2.a ) r1
JOIN
(SELECT table3.a FROM table3 JOIN table4 ON table3.b =table4.b ) r2
ON (r1.a =r2.a) ;
You can execute a sample data in your Hadoop cluster and see yourself the difference between serialized execution and parallel execution of job in HIVE.
For more technical blogs keep visiting www.acadgild.com/blog
Leave a Reply