When it comes to Statistical Analysis, R is one of the most preferred option and by integrating it with Hadoop, we can successfully use it for Big Data Analytics. In this post, we will be discussing the step-by-step explanation for integrating R with Hadoop and will be performing various operations on HDFS using R console.
RHadoop is a collection of three R packages for providing large data operations with an R environment. RHadoop is available with three main R packages, where each of them offer different Hadoop features:
1. Rhdfs
2. Rmr
3. Rhbase
Rhdfs:
Rhdfs is an R package that provides the basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Rhdfs package calls the HDFS API in the backend to operate on the data sources stored in the HDFS. This package should be installed only on the node that will run the R client.
Rmr:
Rmr is an R package that allows R developers to perform Statistical Analysis in R via Hadoop’s MapReduce functionality on a Hadoop cluster. With the help of this package, the job of a R programmer has been reduced, where they just need to divide their application logic into the map and reduce phases and submit it with the Rmr methods. After that, the Rmr calls the Hadoop streaming MapReduce API with several job parameters such as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over the Hadoop cluster. This package should be installed on every node in the cluster.
Rhbase:
Rhbase is an R interface for operating the Hadoop’s HBase data source, stored at the distributed network via a Thrift server. The Rhbase package is designed with several methods for initialization and read/write and table manipulation operations. In this post, we will look in to the Rhdfs package that provides the basic connectivity to the Hadoop Distributed File System. Before delving deeper, let’s look at how to setup Rhadoop.
Steps for Setting up Rhadoop
The per-requisites for installing Rhadoop is Hadoop and R. Assuming they are already installed, let get started with the setup process.
2. Installing R
Required Packages for Installing
We require several R packages to be installed for connecting R with Hadoop. The list of packages are as follows:
- rJava
- RJSONIO
- itertools
- digest
- Rcpp
- httr
- functional
- devtools
- plyr
- reshape2
We will discuss installing of all this packages in two different ways. They are as follows:
1. Using install.packages from R Console:
1 |
install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp ','httr','functional','devtools', 'plyr','reshape2'),dependencies=TRUE,repos='http://cran.rstudio.com/') |
Note: Before installing rJava, we should set the JAVA_HOME path and should login to R with sudo privileges.
2.Downloading Packages and installing through R cmd:
Download the required packages from the below link.
Link: https://drive.google.com/open?id=0B5dejdhAYHztRkgzbGZOeUdXdVE
After downloading the packages, extract them and use the below command:
1 |
unzip Rhadoop_packages.zip |
To install these packages, we will be using R cmd.
1 |
R CMD INSTALL <package name> |
Now we will be Installing rJava ,refer the below command for the same.
1 |
sudo R CMD INSTALL rJava_0.9-6.tar.gz |
We need to follow the same command to install all the other required packages .
1 |
sudo R CMD INSTALL <package.rar> |
Note:Before installing rhdfs we should set HADOOP_CMD environmental variable. You can refer to the below screen shot to follow the steps for Installing Rhdfs.
For accessing HDFS we should start hadoop demons, make sure that all your HDFS daemons are up.
Check the files in HDFS from the command line.
Now we will access HDFS from the R console
Login to R console
Set environment variables
Load the required packages rhdfs
After loading the rhdfs package we should initiate the connection using hdfs.init()
Accessing HDFS through R console
Listing the file in hdfs root directory
1 |
hdfs.ls('/') |
To get the HDFS default configurations used for this connection use
1 |
hdfs.defaults("conf") |
File manipulation
° hdfs.put: This is used to copy files from the local filesystem to the HDFS filesystem.
1 |
hdfs.put('localfile source','hdfs destination') |
hdfs.mkdir: used to create new directory in hdfs:
1 |
hdfs.mkdir('/new_dir') |
° hdfs.move: This is used to move a file from one HDFS directory to another HDFS directory.
1 |
hdfs.move('/test_file','/new_dir/') |
° hdfs.rename: This is used to rename the file stored at HDFS from R.
1 |
hdfs.rename('/new_dir/test_file','/new_dir/test_file1') |
° hdfs.chmod: This is used to change permissions of some files.
1 |
hdfs.chmod('/Wc.txt', permissions= '777') |
° hdfs.delete: This is used to delete the HDFS file or directory from R.
1 |
hdfs.delete("/RHadoop") |
Hope this blog helped you in learning how to integrate R with Hadoop. Keep visiting our site for more updates on BigData and other technologies.
Leave a Reply