08 March 2016

MySQL Metastore Integration With HIVE

Are you fed up of using HIVE with lot of incomplete jobs? OR have lost output? OR Waiting too long to run another job ?

Well then the answer to all these questions and many more is this Blog which will help you in setting up MYSQL database connectivity with Hive and give you the flexibility required to use multiple Hive jobs running at the same time.

Before I go in depth let us first understand what Hive does.

Hive stores the metadata related to tables and databases into the external RDBMS like Apache Derby or MYSQL and metadata.

Now let us understand what the terms – metastore and database mean?

The metastore service provides the interface to the Hive.
The database stores the data definitions and mappings to the data.

The metastore (which consists of services and database) can be configured in different ways. Embedded Apache Derby is used as the default Hive metastore in the Hive configuration. This configuration is called embedded metastore and is good for the sake of development and unit testing, but won’t scale to a production environment as only a single user can connect to the derby database at any instant of time. Starting second instance of the Hive driver will throw an error message.

So what is Apache Derby?

Apache Derby, an Apache DB subproject, is an open source relational database implemented entirely in Java. Some key features include:

Derby is based on the Java, JDBC, and SQL standards.
Derby provides an embedded JDBC driver that lets you embed Derby in any Java-based solution.
Derby also supports the more familiar client/server mode with the Derby Network Client JDBC driver and Derby Network Server.
Derby is easy to install, deploy, and use.

Most importantly Derby is single instance database, which means only one user can access the derby instance at one time and this had been a motivational factor to include Mysql as the default metastore.

Advantages Of using Mysql as a metastore in Hive-

It is Stable
It keeps a track of metadata.
It can support multiple instances of Hive.

Prerequisite:- Hive should be installed with Hadoop daemons running on Centos operating system.

In order to change the default metastore from Derby to Mysql we need to change the property in Hive-site.xml.

Since Hive-0.10, we get only hive-default.xml. We need to explicitly create Hive-site.xml to override the default property containing the configuration of Apache Derby.