14 April 2016

Which Hadoop distribution is right for you?

Hadoop is an open-source framework for processing Big Data. These days, there are many Hadoop distributions to choose from, and one of them happens to be Apache Hadoop distribution, which comes under Apache Software Foundation. This distribution is free and has a very large community behind it.
If an enterprise wants to deploy an Apache Hadoop distribution along with their existing enterprise application, it becomes difficult because Hadoop is written in Java and optimized for Linux-based operating systems. This may lead to impedance mismatch between Hadoop and the current enterprise applications. For this reason, integration of Hadoop ecosystem with existing components of enterprise is not straightforward.

To solve this issue, few companies came up with distribution models for Hadoop. There are three primary kinds of Hadoop distribution flavors. They are as follows:

Companies that provide paid support and training for the Apache Hadoop distribution. (Cloudera, HortonWorks, MapR, IBM, etc.)
Companies that provide a set of supporting tools for deployment and management of Apache Hadoop as an alternative flavor. (Cloudera, HortonWorks, MapR)

The third model is for companies that provide enhancing features of Apache Hadoop by adding vendor specific features and code. These features are paid enhancements, many of which are capable of solving certain use cases. (Cloudera, HortonWorks, MapR, IBM and all companies which take up Hadoop projects)

The parent of all this distributions is the open-source Apache Hadoop. Companies developing these distributions will always stay in touch with Apache Hadoop and follow its trends.

Advantage of using them are:

This distributions generally test all the features in an in depth and timely manner.
They provide support, which saves administration, and management costs for an organization.

The disadvantage of using a distribution other than Apache Hadoop is vendor lock-in.

The tools and vendor specific features provided by one vendor might not be available in another distribution or those features may be not be compatible with other third-party tools, bringing in the cost of migration. The cost of migration is not limited to technology shifts alone, it also involves training, capacity planning, and re-architecting costs for the organization.

Now the big question is, “Which Hadoop distribution is right for your organization?”

Selecting a Hadoop distribution can seem pretty difficult, but it all comes down to which distribution best suits your Big Data needs.

Assuming Hadoop solves the problem, how do you decide which Hadoop distribution to use? They all look similar, all of them containing more than a dozen open-source software components, all distributions work on commodity hardware and can pretty much run similar sets of analytical workloads. Yet there are few differences in terms of what you get for your money.

When evaluating Hadoop distributions, there are different criteria to consider. They are as follows:

How does it perform?

The Apache Hadoop distribution is written in Java and runs in its own virtual machine called JVM (Java Virtual Machine). Though this increases application portability, it comes with little overhead as it should compile the code into byte-code, and has to do garbage collection. Comparatively, it is not as fast as an application that directly compiles for target hardware.

To address this issue, some vendors optimize their distributions for a particular hardware, increasing job performance per node. Features such as compression and decompression can also be optimized for certain hardware types.

Is it scalable?

Distributions should be scalable I.e. distribution should expand in terms of resources, in both, computational and storage dimensions.

When it comes to scaling out the cluster, ideally it should be limited to addition of more disks to existing nodes or adding new nodes to the cluster network. However, distributions might impose few difficulties in effort and cost required for scaling a Hadoop cluster. Scaling out the cluster comes with heavy administration and deployment costs. Scaling costs will depend on the existing architecture and how it complements and complies with the Hadoop distribution that is being evaluated.

Is it reliable?

Any distributed system will be subjected to partial failures. Failures can be because of hardware, software, or network issues, and have a smaller mean time when running on commodity hardware.

The major weakness about Hadoop technology is that, the NameNode used to locate and keep track of all of the other nodes related to a certain data set is a Single point of failure (SPOF). In other words, if the NameNode fails, all of the data in the other nodes is lost because it cannot be found without the NameNode.

To overcome this issue, Hadoop version 2.x has come up with NameNode High Availability. In this approach, Hadoop runs 2 NameNode, one active and other one as a stand by. In case of failure of primary node, the stand by will come up instantly and thereby solving Single point of failure (SPOF).
How manageable is it?

Deploying and managing the Apache Hadoop open-source distribution requires internal understanding of the source code and configuration. This is not a widely available skill in IT administration. In addition, administrators in enterprises are caretakers of a wide range of systems, Hadoop being one of them.

Distributions offer integration with development and debugging tools. Developers and scientists in an enterprise will already be using a set of tools and the more overlap between the toolset used by the organization and distribution, the better it is. The advantage of overlap not only comes in the form of licensing costs, but also in a lesser need for training and orientation. It might also increase productivity within the organization, as people are already accustomed to certain tools.

Now, let’s look at available Hadoop distributions in the market.

There are a number of distributions of Hadoop. A comprehensive list can be found here. Hyperlink below url

http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

We will be examining five of them that are widely used:

Apache Hadoop Distribution
Cloudera Distribution of Hadoop (CDH)
Hortonworks Data Platform (HDP)
MapR
Pivotal HD

Apache Hadoop Distribution:

The Apache Hadoop Distribution is developed by the Apache Software Foundation. This distribution is available free of cost and has a very large community behind it. This distribution serves as the base for all other distributions. The contributions by this community shapes the Apache Hadoop distribution.

Deployment and management of the Apache Hadoop distribution within an enterprise requires internal understanding of the source code and configuration.

Cloudera Distribution of Hadoop (CDH)

Cloudera was formed in March 2009 with a primary objective of providing Apache Hadoop software, support, services, and training for enterprise-class deployment of Hadoop and its ecosystem components.

Cloudera brands its distribution as Cloudera Distribution of Hadoop (CDH). Cloudera is one of the major sponsors of Apache Software Foundation, and will push almost all the enhancements to its distributions. They also support in servicing in Hadoop deployment.

CDH is in its fifth major version (CDH5) right now and is considered to be a mature Hadoop distribution. The paid version of CDH comes with a proprietary management software, Cloudera Manager.

Hortonworks Data Platform (HDP)

Hortonworks is formed in June, 2011, a company with objectives similar to Cloudera. Their distribution is branded as Hortonworks Data Platform (HDP). The HDP suite’s Hadoop and other software are completely free, with paid support and training. Hortonworks also pushes enhancements upstream, back to Apache Hadoop.

HDP is in its second major version currently and is considered the rising star in Hadoop distributions. It comes with a free and open source management software called Ambari.

MapR

MapR was founded in 2009 with a mission to come up enterprise-grade Hadoop. The Hadoop distribution they provide has significant proprietary code when compared to Apache Hadoop. There are a handful of components where they guarantee compatibility with existing Apache Hadoop projects. Key proprietary code for the MapR distribution is the replacement of HDFS with a POSIX-compatible NFS. Another key feature is the capability of taking snapshots.

MapR comes with its own management console. The different grades of the product are named as M3, M5, and M7. M5 is a standard commercial distribution from the company, M3 is a free version without high availability, and M7 is a paid version with a rewritten HBase API.

Pivotal HD

Greenplum is a marquee parallel data store from EMC. EMC integrated Greenplum within Hadoop, giving way to an advanced Hadoop distribution called Pivotal HD. This move alleviated the need to import and export data between stores such as Greenplum and HDFS, bringing down both costs and latency.

The HAWQ is a SQL database layer built on top of HDFS. The technology provided by Pivotal HD allows efficient and low-latency query execution on data stored in HDFS.

The HAWQ technology has been found to give 100 times more improvement on certain MapReduce workloads when compared to Apache Hadoop. HAWQ also provides SQL processing in Hadoop, increasing its popularity among users who are familiar with SQL.

We hope this blog helped you to get a brief of major service providers in Hadoop.

Keep visting our website www.acadgild.com/blog for more blogs on Big data and other technologies.

AcadGild