I m new to RHadoop and also to RMR...
I had an requirement to write a Mapreduce Job in R Mapreduce. I have Tried writing but While executing this it gives an Error.
Tring to read the file from hdfs
Error:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Code :
Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/lib/hadoop")
Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-4.7.0-1.cdh4.7.0.p0.40/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
day_file = hdfs.file("/hdfs/bikes_LR/day.csv","r")
day_read = hdfs.read(day_file)
c = rawToChar(day_read)
XtX =
values(from.dfs(
mapreduce(
input = "/hdfs/bikes_LR/day.csv",
map=
function(.,Xi){
yi =c[Xi[,1],]
Xi = Xi[,-1]
keyval(1,list(t(Xi)%*%Xi))
},
reduce = function(k,v )
{
vals =as.numeric(v)
keyval(k,sum(vals))
} ,
combine = TRUE)))[[1]]
XtY =
values(from.dfs(
mapreduce(
input = "/hdfs/bikes_LR/day.csv",
map=
function(.,Xi){
yi =c[Xi[,1],]
Xi = Xi[,-1]
keyval(1,list(t(Xi)%*%yi))
},
reduce = TRUE ,
combine = TRUE)))[[1]]
solve(XtX,XtY)
Input:
------------
instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
6,2011-01-06,1,0,1,0,4,1,1,0.204348,0.233209,0.518261,0.0895652,88,1518,1606
7,2011-01-07,1,0,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
8,2011-01-08,1,0,1,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,891,959
9,2011-01-09,1,0,1,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
10,2011-01-10,1,0,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321
Please Suggest me any mistakes.
While it's theoretically possible to implement linear regression using R and MapReduce, it's generally not recommended due to several reasons:
ReplyDeleteMachine Learning Projects for Final Year
Efficiency: Modern distributed computing frameworks like Spark provide more efficient and optimized implementations for linear regression.
Complexity: Implementing linear regression in R and MapReduce requires a deep understanding of both linear algebra and distributed computing, making it error-prone and time-consuming.
Scalability: While MapReduce can handle large datasets, it might not be as efficient as specialized frameworks for linear regression, especially when dealing with complex models.
Alternative Approaches
If you're dealing with large datasets and need to perform linear regression, consider these alternatives:
1. Spark:
Offers built-in linear regression algorithms (LinearRegression, GeneralizedLinearRegression) that are optimized for distributed computing.
Provides a high-level API, making it easier to implement and maintain.
2. R with Distributed Computing Libraries:
Use libraries like parallel or foreach for parallel processing within R.
While not as efficient as dedicated frameworks, it can be a good starting point for smaller datasets or exploratory analysis.