23 February 2016

Integration of R with Hadoop

When it comes to Statistical Analysis, R is one of the most preferred option and by integrating it with Hadoop, we can successfully use it for Big Data Analytics. In this post, we will be discussing the step-by-step explanation for integrating R with Hadoop and will be performing various operations on HDFS using R console.

RHadoop is a collection of three R packages for providing large data operations with an R environment. RHadoop is available with three main R packages, where each of them offer different Hadoop features:

1. Rhdfs

2. Rmr

3. Rhbase

Rhdfs:

Rhdfs is an R package that provides the basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Rhdfs package calls the HDFS API in the backend to operate on the data sources stored in the HDFS. This package should be installed only on the node that will run the R client.

Rmr:

Rmr is an R package that allows R developers to perform Statistical Analysis in R via Hadoop’s MapReduce functionality on a Hadoop cluster. With the help of this package, the job of a R programmer has been reduced, where they just need to divide their application logic into the map and reduce phases and submit it with the Rmr methods. After that, the Rmr calls the Hadoop streaming MapReduce API with several job parameters such as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over the Hadoop cluster. This package should be installed on every node in the cluster.

Rhbase:

Rhbase is an R interface for operating the Hadoop’s HBase data source, stored at the distributed network via a Thrift server. The Rhbase package is designed with several methods for initialization and read/write and table manipulation operations. In this post, we will look in to the Rhdfs package that provides the basic connectivity to the Hadoop Distributed File System. Before delving deeper, let’s look at how to setup Rhadoop.

Steps for Setting up Rhadoop

The per-requisites for installing Rhadoop is Hadoop and R. Assuming they are already installed, let get started with the setup process.

1. Installing Java and Hadoop

2. Installing R

Required Packages for Installing

We require several R packages to be installed for connecting R with Hadoop. The list of packages are as follows:

rJava
RJSONIO
itertools
digest
Rcpp
httr
functional
devtools
plyr
reshape2

We will discuss installing of all this packages in two different ways. They are as follows:

1. Using install.packages from R Console:

1	install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp ','httr','functional','devtools', 'plyr','reshape2'),dependencies=TRUE,repos='http://cran.rstudio.com/')

Note: Before installing rJava, we should set the JAVA_HOME path and should login to R with sudo privileges.

2.Downloading Packages and installing through R cmd:

Download the required packages from the below link.

Link: https://drive.google.com/open?id=0B5dejdhAYHztRkgzbGZOeUdXdVE

After downloading the packages, extract them and use the below command:

1	unzip Rhadoop_packages.zip

To install these packages, we will be using R cmd.

1	R CMD INSTALL <package name>

Now we will be Installing rJava ,refer the below command for the same.

1	sudo R CMD INSTALL rJava_0.9-6.tar.gz

We need to follow the same command to install all the other required packages .

1	sudo R CMD INSTALL <package.rar>

Note:Before installing rhdfs we should set HADOOP_CMD environmental variable. You can refer to the below screen shot to follow the steps for Installing Rhdfs.