Wednesday 17 December 2014

RHadoop not working: Java heap space error



    Hello!

    I am trying to get RHadoop working on a Hadoop cluster.

    This is a test cluster for development / proof of concept.
    It has 5 nodes, virtualized in VMware.
    The OS on all nodes is centos 6.4

    To install R on Hadoop, I followed these instructions:

    https://github.com/RevolutionAnalytics/RHadoop/wiki

    I did not encounter any errors during installation.
    However, the following example analysis is constantly failing with a Java heap size error:

    groups = rbinom(100, n = 500, prob = 0.5)
    tapply(groups, groups, length)
    require(‘rmr2′)
    groups = rbinom(100, n = 500, prob = 0.5)
    groups = to.dfs(groups)
    result = mapreduce(
    input = groups,
    map = function(k,v) keyval(v, 1),
    reduce = function(k,vv) keyval(k, length(vv)))
    print(result())
    print(from.dfs(result, to.data.frame=T))

    The code above is from this repo:

    https://github.com/hortonworks/HDP-Public-Utilities/tree/master/Installation/r

    Please find more information here:

    https://raw.githubusercontent.com/manuel-at-coursera/mixedStuff/master/RHadoop_bugReport.md

    Any help to get this solved would be very much appreciated!

    Best,

    Manuel

Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.

    Author
    Replies
    October 1, 2014 at 4:17 am #61153

    Manuel
    Participant

    Yes, the issue seems to be resolved (or at least a work-around available):

    https://groups.google.com/forum/#!topic/rhadoop/E1-riwegvD4

    Basically, it boils down to that rmr2 uses its own settings, independent from what was set in Hadoop itself.
    The feedback from Antonio (link above) was very helpful.

    The following information might be helpfull as well.

    Within Hadoop I used these settings:

    Number of containers: 2
    RAM per container: 2048 MB

    Configuration Setting Value Calculation Value (MB)
    yarn.nodemanager.resource.memory-mb = containers * RAM-per-container 4096
    yarn.scheduler.minimum-allocation-mb = RAM-per-container 2048
    yarn.scheduler.maximum-allocation-mb = containers * RAM-per-container 4096
    mapreduce.map.memory.mb = RAM-per-container 2048
    mapreduce.reduce.memory.mb = 2 * RAM-per-container 4096
    mapreduce.map.java.opts = 0.8 * RAM-per-container 1638
    mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-container 3277
    yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-container 4096
    yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-container 3277

    For rmr2, I used this code to change all in one go (starting from the prompt in centOS 6.4):

    R;
    library(rmr2);
    bp = rmr.options(“backend.parameters”);
    bp$hadoop[1] = “mapreduce.map.java.opts=-Xmx1024M”;
    bp$hadoop[2] = “mapreduce.reduce.java.opts=-Xmx2048M”;
    bp$hadoop[3] = “mapreduce.map.memory.mb=1280″;
    bp$hadoop[4] = “mapreduce.reduce.memory.mb=2560″;
    rmr.options(backend.parameters = bp);
    rmr.options(“backend.parameters”)

1 comment: