31 March 2016

Execution of Hive UDAF

In this blog we will be discussing to implementing Hive UDAF to find the largest Integer from the input file.

We expect the readers to have basic knowledge on Hive so, refer the below links to get the basics of Hive operations.

Hive Beginners Guide

File Formats In Apache

Indexing in Hive

Hive Partitioning In Hive

Let’s start our discussion with understanding of UDAF.

User-Defined Aggregation Functions (UDAFs) are an exceptional way to integrate advanced data-processing into Hive. Aggregate functions perform a calculation on a set of values and return a single value.

An aggregate function is more difficult to write than a regular UDF. Values are aggregated in chunks (potentially across many tasks), so the implementation has to be capable of combining partial aggregations into a final result.

We will start our discussion with the given source code which has been used to find the largest Integer from the input file.

The code to achieve this is explained in the below example,we need to make a jar file of the below source code and then use that jar file while executing hive scripts shown in the upcoming section.

UDAF to find the largest Integer in the table.

package com.hive.udaf;

import org.apache.hadoop.hive.ql.exec.UDAF;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

public class Max extends UDAF

{

public static class MaxIntUDAFEvaluator implements UDAFEvaluator

{

private IntWritable output;

public void init()

{

output=null;

}

public boolean iterate(IntWritable maxvalue) // Process input table

{

if(maxvalue==null)

{

return true;

}

if(output == null)

{

output = new IntWritable(maxvalue.get());

}

else

{

output.set(Math.max(output.get(), maxvalue.get()));

}

return true;

}

public IntWritable terminatePartial()

{

return output;

}

public boolean merge(IntWritable other)

{

return iterate(other);

}

public IntWritable terminate() //final result

{

return output;

}

Let’s see now the steps for UDAF Execution.

Creating a new Input Dataset

We need a input dataset to execute the above example. The Dataset that will be used for demonstration is Numbers_List. It has one column, which contains List of Integer values.