08 July 2016

A Guide to Hive UDTF : User Defined Tabular Function

We all know that Hive is a data warehouse solution built on top of Hadoop for managing a huge amount of data stored in HDFS. It provides an SQL-like query language called HiveQL.

At times, the query the user wants to write can’t be expressed easily (or at all) using the built-in functions provided by Hive. Under such scenarios, Hive allows a user to write his/her own code known as UDFs, to process and invoke results through Hive queries.

There are three types of UDFs (User Defined Functions) in Hive, that can be used by users and they are as follows:

Regular UDF
UDAF – User Defined Aggregation Function
UDTF – User Defined Tabular Function

In this post, we will be discussing how to implementing a Hive UDTF to populate a table, which contains multiple values in a single column based on the primary / unique id.

Some basic knowledge of Hive is necessary to understand the following concepts. Hence, it is best to brush up on the following topics beforehand.

Hive Beginners Guide

Partitioning In Hive

Bucketing in Hive

Execution of Hive UDAF

Differences Between UDF, UDAF and UDTF:

UDF:

UDF is a user-defined function that takes a single input value and produces a single output value. When used in a query, we can call it once for each row in the result set.

Example:

input.toString().toUpperCase();

input.toString().toLowerCase();

The above methods will convert a string of lowercase to uppercase and vice versa.

UDAF:

UDAF is a user-defined aggregate function (UDAF) that accepts a group of values and returns a single value. Users can implement UDAFs to summarize and condense sets of rows in the same style as the built-in COUNT, MAX(), SUM(), and AVG() functions.

Example:

You can refer below post to know more about Hive UDAF.

Execution of Hive UDAF

UDTF:

UDTF is a User Defined Table Generating Function that operates on a single row and produces multiple rows a table as output.

Problem Statement:

Let’s look at how Hive UDTF work with the help of below example. Here, we will create one value for one unique key from a distinct key followed by one or multiple entries.

Data Set Used as Input in the Example:

You can refer to the below screenshot to see what the expected output will be.

Data Set Description:

Unique id of a local resident.
Phone number 1 of that particular unique id local resident.
Phone number 2 of that particular unique id local resident.

Source Code:

We can create a custom Hive UDTF by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.

package com.Myhiveudtf;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;

import org.apache.hadoop.hive.ql.metadata.HiveException;

import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;

import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;

Import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

public class Myudtf extends GenericUDTF {

private PrimitiveObjectInspector stringOI = null;

@Override

public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {

if (args.length != 1) {

throw new UDFArgumentException("NameParserGenericUDTF() takes exactly one argument");

}

if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE&&((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {

throw new UDFArgumentException("NameParserGenericUDTF() takes a string as a parameter");

}

// input inspectors

stringOI = (PrimitiveObjectInspector) args[0];

// output inspectors -- an object with three fields!

List<String> fieldNames = new ArrayList<String>(2);

List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);

fieldNames.add("id");

fieldNames.add("phone_number");

fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);

}

public ArrayList<Object[]> processInputRecord(String id){

ArrayList<Object[]> result = new ArrayList<Object[]>();

// ignoring null or empty input

if (id == null || id.isEmpty()) {

return result;

}

String[] tokens = id.split("\\s+");

if (tokens.length == 2){

result.add(new Object[] { tokens[0], tokens[1]});

}

else if (tokens.length == 3){

result.add(new Object[] { tokens[0], tokens[1]});

result.add(new Object[] { tokens[0], tokens[2]});

}

return result;

}

Initialize()

The Hive calls the initialize method to notify the UDTF the argument types to expect. The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.

@Override

public void process(Object[] record) throws HiveException {

final String id = stringOI.getPrimitiveJavaObject(record[0]).toString();

ArrayList<Object[]> results = processInputRecord(id);

Iterator<Object[]> it = results.iterator();

while (it.hasNext()){

Object[] r = it.next();

forward(r);

}

Process()

Once initialize() method has been called, Hive will give rows to the UDTF using the process() method. While in process() function, the UDTF can produce and forward rows to other operators by calling forward() method.

@Override

public void close() throws HiveException {

// do nothing

}

Close()

Finally, Hive will call the close() method when all the rows have passed to the UDTF. This function allows for any cleanup that is necessary before returning from the User Defined Table Generating Function. It is important to note that we cannot write any records from this function.

So far, from our above example, no data is required which needs to be cleaned up.

Therefore, we can execute the above example program.

Steps for Executing Hive UDTF:

Step 1: After writing the above code in Eclipse, add the below mentioned jar files in the program and then export it in the Hadoop environment as a jar file.

Step 2: Create a table named ‘phone’ with a single column named ‘id’.

Step 3: Load the input data set phn_num contents into the table phone.

Step 4: Check if the data contents are loaded or not, using select statement.

Step 5: Add the jar file with the complete path of the jar made as shown above.

Step 6: Create a temporary function as shown below.

Step 7: Use the select statement to populate the above table of strings with its primary id.

From the above screenshot, we can see that we have populated a single column, which contains multiple values to its primary id.

We hope this post has been helpful in understanding Hive’s User Defined Table Generating Function (UDTF). In case of any queries, please comment below and we will get back to you at the earliest.

Keep visiting our website for more post on Big Data and other technologies.

AcadGild

A Guide to Hive UDTF : User Defined Tabular Function

Related

Manjunath N

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Manjunath N

Related Posts

Leave a Reply