R & hadoop: Install RHadoop on Hortonworks HDP 2.0

RHadoop is a collection of R packages which was developed by Revolution Analytics. It helps you to leverage the Hadoop environment for large data operations with R. Here is the list of packages available

rmr2 – allows you to use Hadoop MapReduce
rhdfs – allows R developers to use Hadoop HDFS
rhbase – provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE
plyrmr – enables you to perform data manipulation operations on hadoop
quickcheck – helps you to perform randomized unit testing for R and you could only perform testing for the rmr2 package

In this post I will describe how to install RHadoop (rmr2,rhdfs,rhbase and plyrmr) on Hortonworks hadoop cluster using HDP 2.0 sandbox. In order to avoid version issues, here are the versions of applications and packages I’ll be using in this tutorial.

R 3.0.2
Hadoop 2.2.0.2.0.6.0-76
rmr2-2.3.0
rhdfs-1.0.8
rhbase-1.2.0
plyrmr-0.1.0

Download and Install Prerequisites

a) Download Hortonworks HDP 2.0 sandbox from here.
b) Install R and curl-dvel(required for ‘RCurl’ library)

[root@sandbox ~]# yum install R curl-devel

c) Once R is installed, open R and install the prerequisite packages for RHadoop
#R

>install.packages(c('RCurl','rJava','RJSONIO', 'itertools', 'digest','Rcpp','httr',

 'functional','devtools','plyr','dplyr','reshape2','R.methodsS3','hydroPSO','caTools','pryr'))

If you get a warning saying ‘package ‘pryr’ is not available (for R version 3.0.2)’

>library(devtools)
>install_github("pryr")

d) Download rmr-2.3.0, rhdfs-1.0.8, rhbase-1.2.0 and plyrmr-0.1.0 from RevolutionAnalytics

Setting environment variables

setup paths of HADOOP_CMD and HADOOP_STREAMING

export HADOOP_CMD=/usr/lib/hadoop/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar

You can also set this via the R console using:

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") 
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar")

Install RHadoop

a) rmr2

[root@sandbox rhadoop]# R CMD INSTALL rmr2_2.3.0.tar.gz

b)rhdfs

[root@sandbox rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gz

c) plyrmr

[root@sandbox rhadoop]# R CMD INSTALL plyrmr_0.1.0.tar.gz

d)rhbase

Installing rhabse needs that you install and build Thrift. Following command installs all the required tools and libraries to build and install the Thrift compiler.

[root@sandbox rhadoop]# sudo yum install automake libtool flex bison pkgconfig gcc-c++

  boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel

Download thrift 0.8 from http://archive.apache.org/dist/thrift/0.8.0/

[root@sandbox rhadoop]#tar -xzf thrift-0.8.0.tar.gz 
[root@sandbox rhadoop]# cd thrift-0.8.0 
[root@sandbox rhadoop]#./configure 
[root@sandbox rhadoop]# make 
[root@sandbox rhadoop]# make install

for advance installation check out

Update PKG_CONFIG_PATH:

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

Verifiy pkg-config path

[root@sandbox ~]# pkg-config --cflags thrift
-I/usr/local/include/thrift

Copy Thrift library

cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/

Install rhbase

[root@sandbox rhadoop]# R CMD INSTALL rhbase_1.2.0.tar.gz

Now your cluster is ready to run RHadoop applications.
Note: remember to export or define in .bashrc

Test RHadoop

In order to test the basic word count example using RHadoop, let’s download the complete works of shakespeare from http://www.gutenberg.org/ebooks/100 and save it as a text file (eg: complete_works_of_shakespeare.txt)

[root@sandbox rhadoop]# hadoop fs -mkdir /data
[root@sandbox rhadoop]# hadoop fs -put complete_works_of_shakespeare.txt /data/

Now, run this word count example R script to verify your installation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26


## loading the libraries

library('rhdfs')

library('rmr2')



## initializing the RHadoop

hdfs.init()



## Word count mapreduce function

wordcount = function(input,output = NULL,pattern = " "){

wc.map = function(., lines) {

keyval(unlist(strsplit(x = lines,split = pattern)),1)}

wc.reduce = function(word, counts ) {

keyval(word, sum(counts))}

mapreduce(input = input,output = output,input.format = "text",

map = wc.map,reduce = wc.reduce,combine = T)}



## Call the wordcount function with path of the input file

wordcount('/data/complete_works_of_shakespeare.txt')



## Copy the temprory output path from your mapreduce output and paste it here 

output <- from.dfs('/tmp/RtmpjGGL0K/fileddf53e34bffb')



## Check the word counts

wordcount <- data.frame(output$key,do.call('rbind',lapply(output$val,"[[",1)))

names(wordcount) <- c("word","count")

wordcount

Install RHadoop on Hortonworks HDP 2.0

Download and Install Prerequisites

Setting environment variables

Install RHadoop

Test RHadoop

2 comments: