R & hadoop: hdfs.read() of rhdfs-1.0.8 cannot load all data from huge csv file on hdfs?

Hi,
    I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA
cluster,
I use the following code to read file from HDFS:
*************************************************************************************************
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");
*************************************************************************************************
When I use dim(data) to verify, it show me as following:
[1] 1523 7
*************************************************************************************************
     But actually, it should be "134279407" instead of "1523".
     I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49
...", and there is
a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read()
only read 2^17 bytes in hadoop2.0?) .
Ref.
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

     Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards,
James Chang

Nested Oldest

James Chang at May 13, 2014 at 8:22 am ⇧

Hi,

Does anyone has this problem? or are there any workaround?
Thanks in advance!

James Chang於 2014年5月7日星期三UTC+8下午7時43分33秒寫道：
- Show quoted text -
reply | permalink
David Champagne at May 15, 2014 at 1:13 am ⇧

rhdfs uses the java api for reading files stored in hdfs. That api will
not necessarily read the entire file in one shot. It will return some
number of bytes for each read. When it reaches the end of the file it
returns -1. In the case of rhdfs, and end of the file will return NULL.
So, you need to loop on the hdfs.read call until NULL is returned
- Show quoted text -
reply | permalink
James Chang at May 15, 2014 at 2:55 pm ⇧

Hi David,

Thanks for your suggestion. So, we need some patch for the rhdfs
package for R?
Am I right? or I can do this in my R program?

Thanks in advance!

David Champagne於 2014年5月15日星期四UTC+8上午3時43分27秒寫道：
- Show quoted text -
reply | permalink
Antonio Piccolboni at May 15, 2014 at 11:39 pm ⇧

The loop would be in your R program, no patch required.

Antonio
- Show quoted text -
reply | permalink
James Chang at May 16, 2014 at 9:12 am ⇧

HI Antonio,

Thanks for your kindly reply.
My test code as following :
*************************************************************************************************
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");
*************************************************************************************************
I'm a newcomer of R, I have no idee about how to modify my code.
Could you please give me some code snippet?

Thanks in advance.

Antonio Piccolboni於 2014年5月16日星期五UTC+8上午2時09分25秒寫道：
- Show quoted text -
reply | permalink
Antonio Piccolboni at May 16, 2014 at 9:59 pm ⇧

Sorry, the basics of the R language are off topic for this group. I would
advise you to strengthen your knowledge of R (e.g.
cran.r-project.org/doc/manuals/R-intro.pdf) before moving on to distributed
computing, RHadoop and the like. We need to walk before we can run.

Antonio

On Thu, May 15, 2014 at 8:42 PM, James Chang wrote:

HI Antonio,

Thanks for your kindly reply.
My test code as following :

*************************************************************************************************
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");
*************************************************************************************************
I'm a newcomer of R, I have no idee about how to modify my code.
Could you please give me some code snippet?

Thanks in advance.

Antonio Piccolboni於 2014年5月16日星期五UTC+8上午2時09分25秒寫道：
The loop would be in your R program, no patch required.

Antonio

On Thu, May 15, 2014 at 2:25 AM, James Chang wrote:

Hi David,

Thanks for your suggestion. So, we need some patch for the rhdfs

package for R?

Am I right? or I can do this in my R program?

Thanks in advance!

David Champagne於 2014年5月15日星期四UTC+8上午3時43分27秒寫道：
rhdfs uses the java api for reading files stored in hdfs. That api

will not necessarily read the entire file in one shot. It will return some
number of bytes for each read. When it reaches the end of the file it
returns -1. In the case of rhdfs, and end of the file will return NULL.
So, you need to loop on the hdfs.read call until NULL is returned

On Wednesday, May 7, 2014 4:43:33 AM UTC-7, James Chang wrote:

Hi,
I have many huge csv files(more 20GB) on my hortonworks HDP

2.0.6.0 GA cluster,

I use the following code to read file from HDFS:

*************************************************************************************************

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")

Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")

Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");

*************************************************************************************************

When I use dim(data) to verify, it show me as following:
[1] 1523 7

*************************************************************************************************

But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72

69 49 ...", and there is

a thread in hadoop-hdfs-user mailing list(why can

FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) .

Ref.

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards,
James Chang

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||

web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google

Groups "RHadoop" group.

hdfs.read() of rhdfs-1.0.8 cannot load all data from huge csv file on hdfs?

1 comment: