R & hadoop: Error in FUN(X[[2L]], …) : Sorry, parameter type `NA' is ambiguous or not supported

This is error because the dataset which you are considering consist of na value. i.e null value
while performing matrix multipication and other any operation with null value create error.
To overcome this problem you need to select one of the follow method:
i) Either use rm=na while taking input. i.e mean remove null values.
ii) Take the dataset which doesn't consist of null value. i.e clean dataset.

No need to go through the document below: It is the reference i followed and search and
paste for the future reference at that time.

I saw high view in this page so i share my experience with you ppl.

I am trying the below R script to built logistic regression model using RHadoop (rmr2, rhdfs packages) on an HDFS data file located at "hdfs://:/somnath/merged_train/part-m-00000" and then testing the model using a test HDFS data file at "hdfs://:/somnath/merged_test/part-m-00000".
We are using CDH4 distribution with Yarn/MR2 running parallel to MR1 supported by Hadoop-0.20. And using the hadoop-0.20 mapreduce and hdfs versions to run the below RHadoop script as Sys.setenv commands shown below.
However, whenever I am running the script, I am facing the below error with very little luck to bypass it. I would appreciate if somebody point me to the possible cause of this error which seems to be due to wrong way of lapply call in R without handling NA arguments.

[root@kkws029 logreg_template]# Rscript logreg_test.R
Loading required package: methods
Loading required package: rJava

HADOOP_CMD=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin/hadoop

Be sure to run hdfs.init()
14/08/11 11:59:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
NULL
NULL
[1] "Starting to build logistic regression model..."
Error in FUN(X[[2L]], ...) :
  Sorry, parameter type `NA' is ambiguous or not supported.
Calls: logistic.regression ... .jrcall -> ._java_valid_objects_list -> lapply -> FUN
Execution halted

Below is my R-script :

#!/usr/bin/env Rscript


Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce")
Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin/hadoop")


Sys.setenv(HADOOP_BIN="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/bin");
Sys.setenv(HADOOP_CONF_DIR="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/conf");
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar")
Sys.setenv(LD_LIBRARY_PATH="/usr/lib64/R/library/rJava/jri")


library(rmr2)
library(rhdfs)

.jinit()
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-auth-2.0.0-cdh4.3.0.jar")
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/hadoop-hdfs-2.0.0-cdh4.3.0.jar")
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar")


hdfs.init()
rmr.options( backend = "hadoop", hdfs.tempdir = "/tmp" )

logistic.regression =
        function(hdfsFilePath, iterations, dims, alpha) {
                r.file <- hdfs.file(hdfsFilePath,"r")

                #hdfsFilePath <- to.dfs(hdfsFilePath)

                lr.map =
                  function(.,M) {
                    Y = M[,1]
                    X = M[,-1]
                    keyval(
                        1,
                        Y * X *
                          g(-Y * as.numeric(X %*% t(plane))))}

                lr.reduce =
                  function(k, Z)
                    keyval(k, t(as.matrix(apply(Z,2,sum))))

                plane = t(rep(0, dims))
                g = function(z) 1/(1 + exp(-z))
                for (i in 1:iterations) {
                        gradient =
                                values(
                                        from.dfs(
                                          mapreduce(
                                            input = as.matrix(hdfs.read.text.file(r.file)),
                                            #input = from.dfs(hdfsFilePath),
                                            map = function(.,M) {
                                                Y = M[,1]
                                                X = M[,-1]
                                                keyval(
                                                        1,
                                                        Y * X *
                                                                g(-Y * as.numeric(X %*% t(plane))))},
                                            reduce = lr.reduce,
                                            combine = T)))
                        plane = plane + alpha * gradient

                        #trace(print(plane),quote(browser()))
                 }
                return(plane) }

#validate logistic regression
logistic.regression.test =
        function(hdfsFilePath, weight) {
                r.file <- hdfs.file(hdfsFilePath,"r")
                lr.test.map =
                        function(.,M) {
                          keyval(
                             1,
                             lapply(as.numeric(M[,-1] %*% t(weight)),function(z) 1/(1 + exp(-z))))}

                probabilities =
                     values(
                             from.dfs(
                                mapreduce(
                                  input = as.matrix(hdfs.read.text.file(r.file)),
                                  map = function(.,M) {
                                        keyval(
                                                1,
                                                lapply(as.numeric(M[,-1] %*% t(weight)), function(z) 1/(1 + exp(-z))))}
                        )))
        return(probabilities) }

out = list()
prob = list()
rmr.options( backend = "hadoop", hdfs.tempdir = "/tmp" )

print("Starting to build logistic regression model...")

  out[['hadoop']] =
## @knitr logistic.regression-run
    logistic.regression(
       "hdfs://XX.XX.XX.XX:NNNN/somnath/merged_train/part-m-00000", 5, 5, 0.05)
  write.csv(as.vector(out[['hadoop']]), "/root/somnath/logreg_data/weights.csv")


print("Building logistic regression model completed.")


  prob[['hadoop']] =
    logistic.regression.test(
       "hdfs://XX.XX.XX.XX:NNNN/somnath/merged_test/part-m-00000", out[['hadoop']])
  write.csv(as.vector(prob[['hadoop']]), "/root/somnath/logreg_data/probabilities.csv")


stopifnot(
  isTRUE(all.equal(out[['local']], out[['hadoop']], tolerance = 1E-7)))

NOTE: I have set following environment variables for HADOOP as follows in root ~/.bash_profile

# Hadoop-specific environment and commands

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce
export HADOOP2_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
#export HADOOP_CMD=${HADOOP_HOME}/bin/hadoop
#export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
#export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

export LD_LIBRARY_PATH=${R_HOME}/library/rJava/jri #:${HADOOP_HOME}/../hadoop-0.20-mapreduce/lib/native/Linux-amd64-64


# Add hadoop-common jar to classpath for PlatformName and FsShell classes; Add hadoop-auth and hadoop-hdfs jars

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP2_HOME}/client-0.20/* #:${HADOOP_HOME}/*.jar:${HADOOP_HOME}/lib/*.jar:${HADOOP2_HOME}/hadoop-common-2.0.0-cdh4.3.0.jar:${HADOOP_HOME}/../hadoop-hdfs/hadoop-hdfs-2.0.0-cdh4.3.0.jar:${HADOOP_HOME}/hadoop-auth-2.0.0-cdh4.3.0.jar:$HADOOP_STREAMING

PATH=$PATH:$R_HOME/bin:$JAVA_HOME/bin:$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH/lib/mahout:/opt/cloudera/parcels/CDH/lib/hadoop:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce:/var/lib/storm-0.9.0-rc2/lib #:$HADOOP_CMD:$HADOOP_STREAMING:$HADOOP_CONF_DIR

export PATH

SAMPLE TRAIN DATASET

0,-4.418,-2.0658,1.2193,-0.68097,0.90894
0,-2.7466,-2.9374,-0.87562,-0.65177,0.53182
0,-0.98846,0.66962,-0.20736,-0.2895,0.002313
0,-2.277,2.492,0.47936,0.4673,-1.5075
0,-5.4391,1.8447,-1.6843,1.465,-0.71099
0,-0.12843,0.066968,0.02678,-0.040851,0.0075902
0,-2.0796,2.4739,0.23472,0.86423,0.45094
0,-3.1796,-0.15429,1.4814,-0.94316,-0.52754
0,-1.9429,1.3111,0.31921,-1.202,0.8552
0,-2.3768,1.9301,0.096005,-0.51971,-0.17544
0,-2.0336,1.991,0.82029,0.018232,-0.33222
0,-3.6388,-3.2903,-2.1076,0.73341,0.75986
0,-2.9146,0.53163,0.49182,-0.38562,-0.76436
0,-3.3816,1.0954,0.25552,-0.11564,-0.01912
0,-1.7374,-0.63031,-0.6122,0.022664,0.23399
0,-1.312,-0.54935,-0.68508,-0.072985,0.036481
0,-3.991,0.55278,0.38666,-0.56128,-0.6748
....

SAMPLE TEST DATASET

0,-0.66666,0.21439,0.041861,-0.12996,-0.36305
0,-1.3412,-1.1629,-0.029398,-0.13513,0.49758
0,-2.6776,-0.40194,-0.97336,-1.3355,0.73202
0,-6.0203,-0.61477,1.5248,1.9967,2.697
0,-4.5663,-1.6632,-1.2893,-1.7972,1.4367
0,-7.2339,2.4589,0.61349,0.39094,2.19
0,-4.5683,-1.3066,1.1006,-2.8084,0.3172
0,-4.1223,-1.5059,1.3063,-0.18935,1.177
0,-3.7135,-0.26283,1.6961,-1.3499,-0.18553
0,-2.7993,1.2308,-0.42244,-0.50713,-0.3522
0,-3.0541,1.8173,0.96789,-0.25138,-0.36246
0,-1.1798,1.0478,-0.29168,-0.26261,-0.21527
0,-2.6459,2.9387,0.14833,0.24159,-2.4811
0,-3.1672,2.479,-1.2103,-0.48726,0.30974
1,-0.90706,1.0157,0.32953,-0.11648,-0.47386
...

edited Aug 11 at 9:02

asked Aug 11 at 6:59

somnathchakrabarti
52211339

This is quite convoluted code. It will take a highly motivated person to go through all this. Please break it down to the offending line. – Roman Luštrik Aug 11 at 8:35

@RomanLuštrik: Since the error is specific to .jrcall in rJava package which I am not calling directly in my code, I am assuming that the problem is in

mapreduce(input
 = as.matrix(hdfs.read.text.file(r.file)),map = function(.,M) { 
keyval(1,lapply(as.numeric(M[,-1] %*% t(weight)), function(z) 1/(1 + 
exp(-z))))} )))

where the input to map is a matrix M read from a file stored in HDFS. Most probably the call to lapply may not be getting the expected input from the matrix. I have added sample train and test data inputs from HDFS files to explain better – somnathchakrabarti Aug 11 at 8:57

@RomanLuštrik: I tried running the below input command from R console and received the same error. So the way I am reading HDFS file as matrix is giving the error: > input = as.matrix(hdfs.read.text.file(r.file)) Error in FUN(X[[2L]], ...) : Sorry, parameter type `NA' is ambiguous or not supported. – somnathchakrabarti Aug 11 at 10:17

In R, traceback(), debug(), debugonce() and browser() can yield insightful. – Roman Luštrik Aug 11 at 10:25

yes I replicated the code line-by-line in R console to understand which specific line was giving the error and found out that the input line is giving the error message. If you can provide some insights on how to read a HDFS file as input matrix that would be helpful. – somnathchakrabarti Aug 11 at 10:27

R & hadoop

Thursday, 30 October 2014

Error in FUN(X[[2L]], …) : Sorry, parameter type `NA' is ambiguous or not supported

SAMPLE TRAIN DATASET

SAMPLE TEST DATASET

No comments:

Post a Comment