Hello!
I am trying to get RHadoop working on a Hadoop cluster.
This is a test cluster for development / proof of concept.
It has 5 nodes, virtualized in VMware.
The OS on all nodes is centos 6.4
To install R on Hadoop, I followed these instructions:
https://github.com/RevolutionAnalytics/RHadoop/wiki
I did not encounter any errors during installation.
However, the following example analysis is constantly failing with a Java heap size error:
groups = rbinom(100, n = 500, prob = 0.5)
tapply(groups, groups, length)
require(‘rmr2′)
groups = rbinom(100, n = 500, prob = 0.5)
groups = to.dfs(groups)
result = mapreduce(
input = groups,
map = function(k,v) keyval(v, 1),
reduce = function(k,vv) keyval(k, length(vv)))
print(result())
print(from.dfs(result, to.data.frame=T))
The code above is from this repo:
https://github.com/hortonworks/HDP-Public-Utilities/tree/master/Installation/r
Please find more information here:
https://raw.githubusercontent.com/manuel-at-coursera/mixedStuff/master/RHadoop_bugReport.md
Any help to get this solved would be very much appreciated!
Best,
Manuel
Viewing 3 replies - 1 through 3 (of 3 total)
You must be logged in to reply to this topic.
Author
Replies
October 1, 2014 at 4:17 am #61153
Manuel
Participant
Yes, the issue seems to be resolved (or at least a work-around available):
https://groups.google.com/forum/#!topic/rhadoop/E1-riwegvD4
Basically, it boils down to that rmr2 uses its own settings, independent from what was set in Hadoop itself.
The feedback from Antonio (link above) was very helpful.
The following information might be helpfull as well.
Within Hadoop I used these settings:
Number of containers: 2
RAM per container: 2048 MB
Configuration Setting Value Calculation Value (MB)
yarn.nodemanager.resource.memory-mb = containers * RAM-per-container 4096
yarn.scheduler.minimum-allocation-mb = RAM-per-container 2048
yarn.scheduler.maximum-allocation-mb = containers * RAM-per-container 4096
mapreduce.map.memory.mb = RAM-per-container 2048
mapreduce.reduce.memory.mb = 2 * RAM-per-container 4096
mapreduce.map.java.opts = 0.8 * RAM-per-container 1638
mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-container 3277
yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-container 4096
yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-container 3277
For rmr2, I used this code to change all in one go (starting from the prompt in centOS 6.4):
R;
library(rmr2);
bp = rmr.options(“backend.parameters”);
bp$hadoop[1] = “mapreduce.map.java.opts=-Xmx1024M”;
bp$hadoop[2] = “mapreduce.reduce.java.opts=-Xmx2048M”;
bp$hadoop[3] = “mapreduce.map.memory.mb=1280″;
bp$hadoop[4] = “mapreduce.reduce.memory.mb=2560″;
rmr.options(backend.parameters = bp);
rmr.options(“backend.parameters”)
Very nice blog,keep updating more posts.
ReplyDeletebig data hadoop training