Thursday, 30 October 2014

Linear Regression in R Mapreduce(RHadoop)

I m new to RHadoop and also to RMR... I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error. Tring to read the file from hdfs

Error:

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
   hadoop streaming failed with error code 1

Code :

Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/lib/hadoop")
Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/bin/hadoop")

Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
day_file = hdfs.file("/hdfs/bikes_LR/day.csv","r")
day_read = hdfs.read(day_file)
c = rawToChar(day_read)

XtX =
  values(from.dfs(
    mapreduce(
      input = "/hdfs/bikes_LR/day.csv",
      map=
        function(.,Xi){
         yi =c[Xi[,1],]
         Xi = Xi[,-1]
         keyval(1,list(t(Xi)%*%Xi))
       },
  reduce = function(k,v )
  {
    vals =as.numeric(v)
    keyval(k,sum(vals))
  } ,
  combine = TRUE)))[[1]]

XtY =
 values(from.dfs(
    mapreduce(
     input = "/hdfs/bikes_LR/day.csv",
     map=
       function(.,Xi){
         yi =c[Xi[,1],]
         Xi = Xi[,-1]
        keyval(1,list(t(Xi)%*%yi))
       },
     reduce = TRUE ,
     combine = TRUE)))[[1]]
solve(XtX,XtY)



Input:
------------

instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
6,2011-01-06,1,0,1,0,4,1,1,0.204348,0.233209,0.518261,0.0895652,88,1518,1606
7,2011-01-07,1,0,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
8,2011-01-08,1,0,1,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,891,959
9,2011-01-09,1,0,1,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
10,2011-01-10,1,0,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321



 Please Suggest me any mistakes.
shareimprove this question

    
You need to find the log output from your R script, which would indicate the error. "hadoop streaming failed with error code 1" just means "the script failed for some reason" –  Sean Owen Jul 3 at 11:19
    
Sometimes the folder where you write must be deleted before writing (if it exists). Check that out. –  adesantos Jul 3 at 11:41
    
thank u for ur answer ... but i have check all the possibilites what u have mentioned... i doubt that there is some problem with code itself...can someone please rectify... –  user3782364 Jul 3 at 15:19

1 comment:

  1. While it's theoretically possible to implement linear regression using R and MapReduce, it's generally not recommended due to several reasons:

    Machine Learning Projects for Final Year


    Efficiency: Modern distributed computing frameworks like Spark provide more efficient and optimized implementations for linear regression.
    Complexity: Implementing linear regression in R and MapReduce requires a deep understanding of both linear algebra and distributed computing, making it error-prone and time-consuming.
    Scalability: While MapReduce can handle large datasets, it might not be as efficient as specialized frameworks for linear regression, especially when dealing with complex models.
    Alternative Approaches
    If you're dealing with large datasets and need to perform linear regression, consider these alternatives:

    1. Spark:
    Offers built-in linear regression algorithms (LinearRegression, GeneralizedLinearRegression) that are optimized for distributed computing.
    Provides a high-level API, making it easier to implement and maintain.
    2. R with Distributed Computing Libraries:
    Use libraries like parallel or foreach for parallel processing within R.
    While not as efficient as dedicated frameworks, it can be a good starting point for smaller datasets or exploratory analysis.

    ReplyDelete