Are you fed up of using HIVE with lot of incomplete jobs? OR have lost output? OR Waiting too long to run another job ?
Well then the answer to all these questions and many more is this Blog which will help you in setting up MYSQL database connectivity with Hive and give you the flexibility required to use multiple Hive jobs running at the same time.
Before I go in depth let us first understand what Hive does.
Hive stores the metadata related to tables and databases into the external RDBMS like Apache Derby or MYSQL and metadata.
Now let us understand what the terms – metastore and database mean?
- The metastore service provides the interface to the Hive.
- The database stores the data definitions and mappings to the data.
The metastore (which consists of services and database) can be configured in different ways. Embedded Apache Derby is used as the default Hive metastore in the Hive configuration. This configuration is called embedded metastore and is good for the sake of development and unit testing, but won’t scale to a production environment as only a single user can connect to the derby database at any instant of time. Starting second instance of the Hive driver will throw an error message.
So what is Apache Derby?
Apache Derby, an Apache DB subproject, is an open source relational database implemented entirely in Java. Some key features include:
- Derby is based on the Java, JDBC, and SQL standards.
- Derby provides an embedded JDBC driver that lets you embed Derby in any Java-based solution.
- Derby also supports the more familiar client/server mode with the Derby Network Client JDBC driver and Derby Network Server.
- Derby is easy to install, deploy, and use.
Most importantly Derby is single instance database, which means only one user can access the derby instance at one time and this had been a motivational factor to include Mysql as the default metastore.
Advantages Of using Mysql as a metastore in Hive-
- It is Stable
- It keeps a track of metadata.
- It can support multiple instances of Hive.
Prerequisite:- Hive should be installed with Hadoop daemons running on Centos operating system.
In order to change the default metastore from Derby to Mysql we need to change the property in Hive-site.xml.
Since Hive-0.10, we get only hive-default.xml. We need to explicitly create Hive-site.xml to override the default property containing the configuration of Apache Derby.
- Use the below link to download Hive-site.xml and MYSQL-connector-java-5.1.2.jar.
https://drive.google.com/drive/folders/0B1QaXx7tpw3SZE0wekRzZnB1Unc
- Inside $HIVE_HOME/conf directory you can directly download Hive-site.xml.
Hive-site.xml
- MYSQL driver to connect with Hive, should be placed inside $HIVE_HOME/lib directory.
MYSQL-connector-java-5.1.2.jar
- Once the Hadoop daemons are installed and running give the following command
1 |
sudo service mysqld start |
- Log-in to MYSQL with root user and create an account.
1 |
mysql -u root |
NOTE: no password for root//by default it is null
This will give privileges for root user to be safe(security purpose).
create user acadgilduser@localhost identified by ‘password’;
- Giving permissions to user using command.
grant all on *.* to acadgilduser@localhost identified by ‘password’ with grant option;
- Followed by
1 |
flush privileges; |
This command simply reloads the grant tables in your MySQL database by enabling the changes to take effect without stopping and restarting MySQL.
and then
1 |
exit; |
- Log-in by the user created just now.
1 |
mysql -u acadgilduser -p |
Give password: password
- On another terminal open Hive, and create database by command.
1 |
Create database AcadGild1; |
You can check whether the Hive is in synchronization with MYSQL.
10. Log into root of MYSQL and go to metastore in databases.
11. Inside metastore database, list all tables.
See all contents inside DBS. This is where you can find metadata created by Hive.
Now your Hive and MYSQL is completely synchronized.
And You can use multiple Hive instance in system.
Leave a Reply