RHadoop is a collection of R
packages which was developed by Revolution Analytics. It helps you to
leverage the Hadoop environment for large data operations with R. Here
is the list of packages available
- rmr2 – allows you to use Hadoop MapReduce
- rhdfs – allows R developers to use Hadoop HDFS
- rhbase – provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE
- plyrmr – enables you to perform data manipulation operations on hadoop
- quickcheck – helps you to perform randomized unit testing for R and you could only perform testing for the rmr2 package
In this post I will describe how to
install RHadoop (rmr2,rhdfs,rhbase and plyrmr) on Hortonworks hadoop
cluster using HDP 2.0 sandbox. In order to avoid version issues, here
are the versions of applications and packages I’ll be using in this
tutorial.
R 3.0.2 Hadoop 2.2.0.2.0.6.0-76 rmr2-2.3.0 rhdfs-1.0.8 rhbase-1.2.0 plyrmr-0.1.0
Download and Install Prerequisites
a) Download Hortonworks HDP 2.0 sandbox from here.b) Install R and curl-dvel(required for ‘RCurl’ library)
[root@sandbox ~]# yum install R curl-develc) Once R is installed, open R and install the prerequisite packages for RHadoop
#R
>install.packages(c('RCurl','rJava','RJSONIO', 'itertools', 'digest','Rcpp','httr',
'functional','devtools','plyr','dplyr','reshape2','R.methodsS3','hydroPSO','caTools','pryr'))If you get a warning saying ‘package ‘pryr’ is not available (for R version 3.0.2)’
>library(devtools) >install_github("pryr")d) Download rmr-2.3.0, rhdfs-1.0.8, rhbase-1.2.0 and plyrmr-0.1.0 from RevolutionAnalytics
Setting environment variables
setup paths of HADOOP_CMD and HADOOP_STREAMINGexport HADOOP_CMD=/usr/lib/hadoop/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jarYou can also set this via the R console using:
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar")
Install RHadoop
a) rmr2[root@sandbox rhadoop]# R CMD INSTALL rmr2_2.3.0.tar.gzb)rhdfs
[root@sandbox rhadoop]# R CMD INSTALL rhdfs_1.0.8.tar.gzc) plyrmr
[root@sandbox rhadoop]# R CMD INSTALL plyrmr_0.1.0.tar.gzd)rhbase
- Installing rhabse needs that you install and build Thrift. Following command installs all the required tools and libraries to build and install the Thrift compiler.
[root@sandbox rhadoop]# sudo yum install automake libtool flex bison pkgconfig gcc-c++
boost-devel libevent-devel zlib-devel python-devel ruby-devel openssl-devel
- Download thrift 0.8 from http://archive.apache.org/dist/thrift/0.8.0/
[root@sandbox rhadoop]#tar -xzf thrift-0.8.0.tar.gz [root@sandbox rhadoop]# cd thrift-0.8.0 [root@sandbox rhadoop]#./configure [root@sandbox rhadoop]# make [root@sandbox rhadoop]# make installfor advance installation check out
- Update PKG_CONFIG_PATH:
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
- Verifiy pkg-config path
[root@sandbox ~]# pkg-config --cflags thrift -I/usr/local/include/thrift
- Copy Thrift library
cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/
- Install rhbase
[root@sandbox rhadoop]# R CMD INSTALL rhbase_1.2.0.tar.gzNow your cluster is ready to run RHadoop applications.
Note: remember to export or define in .bashrc
Test RHadoop
In order to test the basic word count example using RHadoop, let’s download the complete works of shakespeare from http://www.gutenberg.org/ebooks/100 and save it as a text file (eg: complete_works_of_shakespeare.txt)[root@sandbox rhadoop]# hadoop fs -mkdir /data [root@sandbox rhadoop]# hadoop fs -put complete_works_of_shakespeare.txt /data/Now, run this word count example R script to verify your installation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
ln /usr/lib/rstudio-server/bin/rstudio-server /etc/init.d/rstudio-server #make the link of rstudio-server in /etc/init.d
ReplyDeleteexport JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk
ReplyDelete