We all know that Hive is a data warehouse solution built on top of Hadoop for managing a huge amount of data stored in HDFS. It provides an SQL-like query language called HiveQL.
At times, the query the user wants to write can’t be expressed easily (or at all) using the built-in functions provided by Hive. Under such scenarios, Hive allows a user to write his/her own code known as UDFs, to process and invoke results through Hive queries.
There are three types of UDFs (User Defined Functions) in Hive, that can be used by users and they are as follows:
- Regular UDF
- UDAF – User Defined Aggregation Function
- UDTF – User Defined Tabular Function
In this post, we will be discussing how to implementing a Hive UDTF to populate a table, which contains multiple values in a single column based on the primary / unique id.
Some basic knowledge of Hive is necessary to understand the following concepts. Hence, it is best to brush up on the following topics beforehand.
Differences Between UDF, UDAF and UDTF:
UDF:
UDF is a user-defined function that takes a single input value and produces a single output value. When used in a query, we can call it once for each row in the result set.
Example:
input.toString().toUpperCase();
input.toString().toLowerCase();
The above methods will convert a string of lowercase to uppercase and vice versa.
UDAF:
UDAF is a user-defined aggregate function (UDAF) that accepts a group of values and returns a single value. Users can implement UDAFs to summarize and condense sets of rows in the same style as the built-in COUNT, MAX(), SUM(), and AVG() functions.
Example:
You can refer below post to know more about Hive UDAF.
UDTF:
UDTF is a User Defined Table Generating Function that operates on a single row and produces multiple rows a table as output.
Problem Statement:
Let’s look at how Hive UDTF work with the help of below example. Here, we will create one value for one unique key from a distinct key followed by one or multiple entries.
Data Set Used as Input in the Example:
You can refer to the below screenshot to see what the expected output will be.
Data Set Description:
- Unique id of a local resident.
- Phone number 1 of that particular unique id local resident.
- Phone number 2 of that particular unique id local resident.
Source Code:
We can create a custom Hive UDTF by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
package com.Myhiveudtf; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; Import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; public class Myudtf extends GenericUDTF { private PrimitiveObjectInspector stringOI = null; @Override public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException { if (args.length != 1) { throw new UDFArgumentException("NameParserGenericUDTF() takes exactly one argument"); } if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE&&((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) { throw new UDFArgumentException("NameParserGenericUDTF() takes a string as a parameter"); } // input inspectors stringOI = (PrimitiveObjectInspector) args[0]; // output inspectors -- an object with three fields! List<String> fieldNames = new ArrayList<String>(2); List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2); fieldNames.add("id"); fieldNames.add("phone_number"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } public ArrayList<Object[]> processInputRecord(String id){ ArrayList<Object[]> result = new ArrayList<Object[]>(); // ignoring null or empty input if (id == null || id.isEmpty()) { return result; } String[] tokens = id.split("\\s+"); if (tokens.length == 2){ result.add(new Object[] { tokens[0], tokens[1]}); } else if (tokens.length == 3){ result.add(new Object[] { tokens[0], tokens[1]}); result.add(new Object[] { tokens[0], tokens[2]}); } return result; } |
Initialize()
The Hive calls the initialize method to notify the UDTF the argument types to expect. The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
@Override public void process(Object[] record) throws HiveException { final String id = stringOI.getPrimitiveJavaObject(record[0]).toString(); ArrayList<Object[]> results = processInputRecord(id); Iterator<Object[]> it = results.iterator(); while (it.hasNext()){ Object[] r = it.next(); forward(r); } } |
Process()
Once initialize() method has been called, Hive will give rows to the UDTF using the process() method. While in process() function, the UDTF can produce and forward rows to other operators by calling forward() method.
1 2 3 4 5 6 7 8 9 |
@Override public void close() throws HiveException { // do nothing } } |
Close()
Finally, Hive will call the close() method when all the rows have passed to the UDTF. This function allows for any cleanup that is necessary before returning from the User Defined Table Generating Function. It is important to note that we cannot write any records from this function.
So far, from our above example, no data is required which needs to be cleaned up.
Therefore, we can execute the above example program.
Steps for Executing Hive UDTF:
Step 1: After writing the above code in Eclipse, add the below mentioned jar files in the program and then export it in the Hadoop environment as a jar file.
Step 2: Create a table named ‘phone’ with a single column named ‘id’.
Step 3: Load the input data set phn_num contents into the table phone.
Step 4: Check if the data contents are loaded or not, using select statement.
Step 5: Add the jar file with the complete path of the jar made as shown above.
Step 6: Create a temporary function as shown below.
Step 7: Use the select statement to populate the above table of strings with its primary id.
From the above screenshot, we can see that we have populated a single column, which contains multiple values to its primary id.
We hope this post has been helpful in understanding Hive’s User Defined Table Generating Function (UDTF). In case of any queries, please comment below and we will get back to you at the earliest.
Keep visiting our website for more post on Big Data and other technologies.
Leave a Reply