Install RHadoop on Hortonworks HDP 2.0

RHadoop is a collection of R packages which was developed by Revolution Analytics. It helps you to leverage the Hadoop environment for large data operations with R. Here is the list of packages available
  1. rmr2 – allows you to use Hadoop MapReduce
  2. rhdfs – allows R developers to use Hadoop HDFS
  3. rhbase – provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE
  4. plyrmr – enables you to perform data manipulation operations on hadoop
  5. quickcheck – helps you to perform randomized unit testing for R and you could only perform testing for the rmr2 package
In this post I will describe how to install RHadoop (rmr2,rhdfs,rhbase and plyrmr) on Hortonworks hadoop cluster using HDP 2.0 sandbox. In order to avoid version issues, here are the versions of applications and packages I’ll be using in this tutorial.
R 3.0.2
Hadoop 2.2.0.2.0.6.0-76
rmr2-2.3.0
rhdfs-1.0.8
rhbase-1.2.0
plyrmr-0.1.0

Download and Install Prerequisites

a) Download Hortonworks HDP 2.0 sandbox from here.
b) Install R and curl-dvel(required for ‘RCurl’ library)
[root@sandbox ~]# yum install R curl-devel
c) Once R is installed, open R and install the prerequisite packages for RHadoop
#R
>install.packages(c('RCurl','rJava','RJSONIO', 'itertools', 'digest','Rcpp','httr',
 'functional','devtools','plyr','dplyr','reshape2','R.methodsS3','hydroPSO','caTools','pryr'))
If you get a warning saying ‘package ‘pryr’ is not available (for R version 3.0.2)’
>library(devtools)
>install_github("pryr")
d) Download rmr-2.3.0, rhdfs-1.0.8, rhbase-1.2.0 and plyrmr-0.1.0 from RevolutionAnalytics 

Setting environment variables

setup paths of HADOOP_CMD and HADOOP_STREAMING
export HADOOP_CMD=/usr/lib/hadoop/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar
You can also set this via the R console using:
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") 
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar")

Install RHadoop

a) rmr2
[root@sandbox rhadoop]# R CMD INSTALL rmr2_2.3.0.tar.gz
b)rhdfs
[root@sandbox rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz
c) plyrmr
[root@sandbox rhadoop]# R CMD INSTALL plyrmr_0.1.0.tar.gz
d)rhbase
  • Installing rhabse needs that you install and build Thrift. Following command installs all the required tools and libraries to build and install the Thrift compiler.
[root@sandbox rhadoop]# sudo yum install automake libtool flex bison pkgconfig gcc-c++
  boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel
[root@sandbox rhadoop]#tar -xzf thrift-0.8.0.tar.gz 
[root@sandbox rhadoop]# cd thrift-0.8.0 
[root@sandbox rhadoop]#./configure 
[root@sandbox rhadoop]# make 
[root@sandbox rhadoop]# make install
for advance installation check out
  • Update PKG_CONFIG_PATH:
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
  • Verifiy pkg-config path
[root@sandbox ~]# pkg-config --cflags thrift
-I/usr/local/include/thrift
  • Copy Thrift library
cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/
  • Install rhbase
[root@sandbox rhadoop]# R CMD INSTALL rhbase_1.2.0.tar.gz
Now your cluster is ready to run RHadoop applications.
Note: remember to export or define in .bashrc

Test RHadoop

In order to test the basic word count example using RHadoop, let’s download the complete works of shakespeare from http://www.gutenberg.org/ebooks/100 and save it as a text file (eg: complete_works_of_shakespeare.txt)
[root@sandbox rhadoop]# hadoop fs -mkdir /data
[root@sandbox rhadoop]# hadoop fs -put complete_works_of_shakespeare.txt /data/
Now, run this word count example R script to verify your installation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## loading the libraries
library('rhdfs')
library('rmr2')
## initializing the RHadoop
hdfs.init()
## Word count mapreduce function
wordcount = function(input,output = NULL,pattern = " "){
wc.map = function(., lines) {
keyval(unlist(strsplit(x = lines,split = pattern)),1)}
wc.reduce = function(word, counts ) {
keyval(word, sum(counts))}
mapreduce(input = input,output = output,input.format = "text",
map = wc.map,reduce = wc.reduce,combine = T)}
## Call the wordcount function with path of the input file
wordcount('/data/complete_works_of_shakespeare.txt')
## Copy the temprory output path from your mapreduce output and paste it here
output <- from.dfs('/tmp/RtmpjGGL0K/fileddf53e34bffb')
## Check the word counts
wordcount <- data.frame(output$key,do.call('rbind',lapply(output$val,"[[",1)))
names(wordcount) <- c("word","count")
wordcount

2 comments:

  1. ln /usr/lib/rstudio-server/bin/rstudio-server /etc/init.d/rstudio-server #make the link of rstudio-server in /etc/init.d

    ReplyDelete
  2. export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk

    ReplyDelete