Before understanding what Hbase is, we need to understand why Hbase was introduced at first place.
Prior to hbase we had Relation Database Management system (RDBMS) from late 1970’s , and it helped lot of companies to implement the solutions for their problems which are in use today.
Even today there are many uses cases were RDBMS is perfect tool. Ex: Handling transactions, etc
Yet there are some problems like handling big data which cannot be solved through RDBMS .
Age of Bigdata
We live in an era where Peta bytes of data is generated on daily basis from sources like socialmedia, ecommerce etc .Because of this, companies are focusing on delivering more targeted information, such as giving recommendations or online ads which influences their success as a business. With the Emergence of new machine learning algorithm,the need for collection of data have increased drastically and with collected data, technology like hadoop is able to process it with ease.
In the past, Due to restrictions on cost required to store the data ,companies use to ignore historical data and used to retain only last N days data and keep all the remaining data as back up in tape drives.
Because of performing analytics on limited data ,resulting models were not effective.
Few companies like Google ,amazon etc, realized the importance of the data and started developing the solutions for solving the big data problems. These ideas were then implemented outside of Google as part of the open source Hadoop project: HDFS and Map Reduce.
But hadoop was mainly introduced for batch processing but companies also needed a database which could be used for real time responses.
So ,Google came up with the Bigtable, A column-oriented data base to address the real time queries.
Before going deep into Hbase and its operations, let’s first understand Column oriented database.
Column oriented databases differ from row oriented traditional databases where entire rows are stored contiguously.
In Column oriented database data are grouped by columns and subsequent columns are stored contiguously on the disk.
Storing values on a per column basis increases the efficiency when all the values are not needed.
In column oriented database values of one columns are very much similar in nature or even vary only slightly between logical rows and this makes them a very suitable candidate for compression when compared to the heterogeneous values of row oriented record structures.
Introduction to Hbase
Till now we understood the background for using HBASE and its efficiency in handling big data.
Now let’s see what Hbase is?
Hbase is the open source implementation of Google’s Big Table, with slight modifications. Hbase was created in 2007 , it was initially a contributions to Hadoop and it later became a top level Apache project.
HBase is a distributed column-oriented database built on top of the Hadoop file system and it is horizontally scalable Meaning we can add the new nodes the Hbase as data grows.
It is well suited for sparse data sets, which are common in many big data use cases.
An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
Below description summarizes the difference between Hbase and RDBMS.
HBase Vs RDBMS
HBase | RDBMS |
HBase is schema-less ,i.e there is no predefined schema for hbase tables. | RDBMS Tables have Fixed schema, which describes the whole structure of tables. |
It is built for wide tables. HBase is horizontally scalable. | It is thin and built for small tables. Hard to scale. |
No transactions are there in HBase. | RDBMS is transactional. |
It has de-normalized data. | It will have normalized data. |
It is good for semi-structured as well as structured data. | It is good for structured data. |
Before proceeding further we will install Hbase, Click here to download Hbase installation document.
We can interact with Hbase in two ways.
- Through Hbase interactive shell
- Hbase java Client API
In the blog we will see interacting Hbase through Hbase shell
Hbase shell is made up of JRuby( JRuby is Java implementation of the Ruby) , we can login to hbase using below command
$HBASE_HOME/bin/hbase shell
1 2 3 4 5 6 7 |
[acadgild@localhost Downloads]$ hbase shell 2015-12-15 10:39:46,050 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.98.14-hadoop2, r4e4aabb93b52f1b0fef6b66edd06ec8923014dec, Tue Aug 25 22:35:44 PDT 2015 hbase(main):001:0> |
Version command will display the version of hbase
1 2 |
hbase(main):001:0> version 0.98.14-hadoop2, r4e4aabb93b52f1b0fef6b66edd06ec8923014dec, Tue Aug 25 22:35:44 PDT 2015 |
List command will list all the tables present Hbase
1 2 3 4 5 6 7 8 9 10 11 12 |
hbase(main):002:0> list TABLE SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 2015-12-15 10:40:35,051 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable customer 1 row(s) in 2.8550 seconds => ["customer"] hbase(main):003:0> |
Now we will see the hbase basic commands. Before going to the commands we will see the structure of Hbase table
Column – A single field in table
Column-family – is group of columns
Row-key-Row-key in hbase is mandatory field which serves as the unique identifier for every record.
Creating the table in Hbase
Syntax : create ‘<table-name>’,’<column-family1>’ ,’<column-family2>’ …….
In the HBase data model columns are grouped into column families, which must be defined during table creation. In Hbase we should at least have one column family. HBase currently does not do well with above three column families so keep the number of column families in your schema low.
Column-family
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
hbase(main):021:0> create 'customer','address','order' 0 row(s) in 0.4030 seconds => Hbase::Table - customer hbase(main):022:0> list TABLE customer 1 row(s) in 0.0110 seconds => ["customer"] |
Inserting the data into hbase
We can insert the data into hbase using PUT command
Syntax: put ‘<table-name>’,’row-key’,’columnfamily:columnname’,’value’
Row-key in hbase is mandatory field which serves as the unique identifier for every record.
Here in the below description customer is the table name and john is the row key followed by column family and its value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
hbase(main):026:0> put 'customer','john','address:city','Boston' 0 row(s) in 0.0290 seconds hbase(main):027:0> put 'customer','john','address:state','Mashitushes' 0 row(s) in 0.0060 seconds hbase(main):028:0> put 'customer','john','address:street','street1' 0 row(s) in 0.0130 seconds hbase(main):029:0> put 'customer','john','order:number','ORD-15' 0 row(s) in 0.0260 seconds hbase(main):030:0> put 'customer','john','order:amount','15' 0 row(s) in 0.0120 seconds |
Inserting second record
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
hbase(main):034:0> put 'customer','Finch','address:city','Newyork' 0 row(s) in 0.0060 seconds hbase(main):035:0> put 'customer','Finch','address:state','Newyork' 0 row(s) in 0.0060 seconds hbase(main):036:0> put 'customer','Finch','order:number','ORD-16' 0 row(s) in 0.0090 seconds hbase(main):037:0> put 'customer','Finch','order:amount','15' 0 row(s) in 0.0080 seconds |
Getting single record from table
We should use GET command to retrieve single record from Hbase table,
Syntax: get ‘<table-name>’,’<row-key>’,’<column-family>’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
get 'customer','john' COLUMN CELL address:city timestamp=1450143157606, value=Boston address:state timestamp=1450143185560, value=Mashitushes address:street timestamp=1450143246875, value=street1 order:amount timestamp=1450143320786, value=15 order:number timestamp=1450143305944, value=ORD-15 5 row(s) in 0.0180 seconds |
Using get command to retrieve the address of john
1 2 3 4 5 6 7 8 9 10 11 |
hbase(main):044:0> get 'customer','john','address' COLUMN CELL address:city timestamp=1450143157606, value=Boston address:state timestamp=1450143185560, value=Mashitushes address:street timestamp=1450143246875, value=street1 3 row(s) in 0.0330 seconds |
Using get command to retrieve city of john
1 2 3 4 5 6 7 |
hbase(main):045:0> get 'customer','john','address:city' COLUMN CELL address:city timestamp=1450143157606, value=Boston 1 row(s) in 0.0060 seconds |
To get the all the records fromtable we should use scan
Syntax : scan ‘<table -name>’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
hbase(main):041:0> scan 'customer' ROW COLUMN+CELL Finch column=address:city, timestamp=1450143461624, value=Newyork Finch column=address:state, timestamp=1450143466906, value=Newyork Finch column=order:amount, timestamp=1450143490833, value=15 Finch column=order:number, timestamp=1450143479920, value=ORD-16 john column=address:city, timestamp=1450143157606, value=Boston john column=address:state, timestamp=1450143185560, value=Mashitushes john column=address:street, timestamp=1450143246875, value=street1 john column=order:amount, timestamp=1450143320786, value=15 john column=order:number, timestamp=1450143305944, value=ORD-15 2 row(s) in 0.0230 seconds |
Deleting records
Deleting entire record from table
1 2 3 4 5 |
delete ‘<table-name>’,’<rowkey>’ hbase(main):046:0> delete 'customer','Finch' 0 row(s) in 0.0270 seconds |
Deleting specific column from table
1 2 3 |
hbase(main):046:0> delete 'customer',john,'address:city' 0 row(s) in 0.0270 seconds |
Counting number of rows in the table
1 2 3 |
hbase(main):047:0> count 'customer' 2 row(s) in 0.0320 seconds |
Version in hbase
Updating tables means replacing the previous value with the new one. But in Hbase if we try to rewrite the column values in Hbase, it does not overwrite the existing value but rather stores different values per row by time (and qualifier). Excess versions are removed during major compaction. The number of max versions may need to be increased or decreased depending on application needs.
Default version in hbase is 1 , we can modify and increase or decrease the versions to be stored using alter command:
1 2 3 4 5 6 |
hbase(main):048:0> alter 'customer',NAME=>'address',VERSIONS=>5 Updating all regions with the new schema... 0/1 regions updated. 1/1 regions updated. Done. 0 row(s) in 2.2290 seconds |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
hbase(main):049:0> put 'customer','Finch','address:city','Newyork' 0 row(s) in 0.0190 seconds hbase(main):050:0> put 'customer','Finch','address:city','Detroit' 0 row(s) in 0.0090 seconds hbase(main):051:0> put 'customer','Finch','address:city','Sanfranscisco' 0 row(s) in 0.0110 seconds hbase(main):052:0> hbase(main):052:0> scan 'customer',{COLUMN=>'address:city',VERSIONS=>2} ROW COLUMN+CELL Finch column=address:city, timestamp=1450147800933, value=Sanfranscisco Finch column=address:city, timestamp=1450147785900, value=Detroit john column=address:city, timestamp=1450143157606, value=Boston 2 row(s) in 0.0170 seconds hbase(main):053:0> scan 'customer',{COLUMN=>'address:city',VERSIONS=>1} ROW COLUMN+CELL Finch column=address:city, timestamp=1450147800933, value=Sanfranscisco john column=address:city, timestamp=1450143157606, value=Boston 2 row(s) in 0.0170 seconds hbase(main):054:0> scan 'customer',{COLUMN=>'address:city',VERSIONS=>3} ROW COLUMN+CELL Finch column=address:city, timestamp=1450147800933, value=Sanfranscisco Finch column=address:city, timestamp=1450147785900, value=Detroit Finch column=address:city, timestamp=1450147775468, value=Newyork john column=address:city, timestamp=1450143157606, value=Boston 2 row(s) in 0.0140 seconds |
Dropping table
Before dropping the table we should disable the table
disable ‘table-name’
1 |
disable ‘customer’ |
drop ‘table-name’
1 |
drop ‘customer’ |
We hope this blog helped you in getting brief overview of hbase and its implementation in hadoop.
Leave a Reply