In this blog we will be discussing to implementing Hive UDAF to find the largest Integer from the input file.
We expect the readers to have basic knowledge on Hive so, refer the below links to get the basics of Hive operations.
Let’s start our discussion with understanding of UDAF.
User-Defined Aggregation Functions (UDAFs) are an exceptional way to integrate advanced data-processing into Hive. Aggregate functions perform a calculation on a set of values and return a single value.
An aggregate function is more difficult to write than a regular UDF. Values are aggregated in chunks (potentially across many tasks), so the implementation has to be capable of combining partial aggregations into a final result.
We will start our discussion with the given source code which has been used to find the largest Integer from the input file.
The code to achieve this is explained in the below example,we need to make a jar file of the below source code and then use that jar file while executing hive scripts shown in the upcoming section.
UDAF to find the largest Integer in the table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
package com.hive.udaf; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; public class Max extends UDAF { public static class MaxIntUDAFEvaluator implements UDAFEvaluator { private IntWritable output; public void init() { output=null; } public boolean iterate(IntWritable maxvalue) // Process input table { if(maxvalue==null) { return true; } if(output == null) { output = new IntWritable(maxvalue.get()); } else { output.set(Math.max(output.get(), maxvalue.get())); } return true; } public IntWritable terminatePartial() { return output; } public boolean merge(IntWritable other) { return iterate(other); } public IntWritable terminate() //final result { return output; } } } |
Let’s see now the steps for UDAF Execution.
- Creating a new Input Dataset
We need a input dataset to execute the above example. The Dataset that will be used for demonstration is Numbers_List. It has one column, which contains List of Integer values.
- Create a new table and load the input dataset
In the below screenshot we have a created a new table Num_list with only one field(column) Num.
Next, we have loaded the input dataset Numbers_List contents into the table Num_List.
- Display the contents of table Num_list to ensure whether the input file have been loaded successfully or not.
By using select statement command we can see if the contents of the dataset Numbers_List have been loaded to the table Num_list or not.
- Add the Jar file in hive with complete path (Jar file made from source code need to be added)
We can see in the above screenshot we have added h-udaf.jar in hive.
- Create temporary function as shown below
Let us create a temporary function max for newly created UDAF.
- Use the select statement to find the largest number from the table Num_List
After, successfully following the above steps we can see use the Select statement command to find the largest number in the table.
Thus, from the above screenshot we can see the largest number in the table Num_list is 99.
We hope this blog helped you in understanding the Hive UDAF and its execution.
Keep visiting our website for more blogs on Big Data and other technologies.
Leave a Reply