Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Before going deep into this topic, readers are recommended to go through these blogs:
Getting ready
Prerequisites to go further in this blogs are:
>>hadoop single node cluster running
>>mysql inside hadoop ecosystem
>>hive-0.7.0 or higher running
We will first create a table acad with below schema and we will be dynamically creating one more table where the column sessionID will be replaced by column weblength.
Create table acad(webpage string,sessionid int,sessionin string,sessionout string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’;
On checking for data inside the sample test data we see:
Input some sample data by the below command syntax and in this case we are taking the dataset shown in the above screenshot.
load data local inpath<filename> into table<tablename>
Eg:
load data local inpath ‘/home/hadoop/Desktop/test2’ into table acad;
Enter the following command in the Hive shell:
describe acad;
You will see the following response:
OK
webpage(string)
sessionid(int)
sessionin(string)
sessionout(string)
Using Hive to dynamically create tables
This blog will give technique for inline table creation when the query is executed.Creating every table definition up front is impractical and does not support for large ETL. Dynamically defining tables is very useful for complex analytics and with multiple staging points.
As discussed above we will be creating a column weblength dynamically and replacing the column sessionID from the table acad.
New table contains three fields from acad:
webpage(string)
sessionin(string)
sessionout(string)
In addition to this, we will define a new field called weblength(int)
Carry out the following steps to create an inline table definition using an alias:
- Open a text editor of your choice.
- Add the following inline creation syntax:
create table <tablename> as select <columnname1,columnname2,…><columnname> as <new_column name>from acad;
Eg:
create table acad_with_length as select webpage,sessionin,sessionout,length(webpage) as weblength from acad;
- Save the script as filename.hql (acad_createtable_as.hql) in the active directory.
- Run the script from the operating system shell by supplying the -f option to the Hive client, as follows: hive -f acad_createtable_as.hql
- To verify that the table was created successfully, issue the following command to the Hive client directly, using the -e option: hive -e “describe acad_with_length”
- You should see a table with three string fields and a fourth int field holding the URL length:
OK
webpage(string)
sessionin(string)
sessionout(string)
weblength(int)
Explanation of the scripts:
CREATE TABLE acad_with_weblength AS
The above statement initially defines a new table by the name acad_with_ length:
SELECT webpage,sessionin,sessionout , length(webpage) as weblength FROM acad;
We then define the body of this table as an alias to the result set of a nested SELECT statement. In this case, our SELECT statement simply grabs the webpage,sessionin, and sessionout fields from each entry in the acad table. The field names are copied as field names to our new table acad_with_length. We also defined an additional field aliased as weblength to be calculated for each selected record. It stores an int value that represents the number of characters in the record’s url field.
In one simple statement, we created a table with a subset of fields from our starting table, as well as a new derived field.
Leave a Reply