hdfs.read() of rhdfs-1.0.8 cannot load all data from huge csv file on hdfs?


Hi,
    I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA
cluster,
I use the following code to read file from HDFS:
*************************************************************************************************
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");
*************************************************************************************************
When I use dim(data) to verify, it show me as following:
[1] 1523 7
*************************************************************************************************
     But actually, it should be "134279407" instead of "1523".
     I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49
...", and there is
a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read()
only read 2^17 bytes in hadoop2.0?) .
Ref.
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

     Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards,
James Chang
  • James Chang at May 13, 2014 at 8:22 am
    Hi,

         Does anyone has this problem? or are there any workaround?
    Thanks in advance!

    James Chang於 2014年5月7日星期三UTC+8下午7時43分33秒寫道:
    - Show quoted text -
  • David Champagne at May 15, 2014 at 1:13 am
    rhdfs uses the java api for reading files stored in hdfs. That api will
    not necessarily read the entire file in one shot. It will return some
    number of bytes for each read. When it reaches the end of the file it
    returns -1. In the case of rhdfs, and end of the file will return NULL.
      So, you need to loop on the hdfs.read call until NULL is returned
    - Show quoted text -
  • James Chang at May 15, 2014 at 2:55 pm
    Hi David,

          Thanks for your suggestion. So, we need some patch for the rhdfs
    package for R?
    Am I right? or I can do this in my R program?

    Thanks in advance!

    David Champagne於 2014年5月15日星期四UTC+8上午3時43分27秒寫道:
    - Show quoted text -
  • Antonio Piccolboni at May 15, 2014 at 11:39 pm
    The loop would be in your R program, no patch required.


    Antonio
    - Show quoted text -
  • James Chang at May 16, 2014 at 9:12 am
    HI Antonio,

         Thanks for your kindly reply.
    My test code as following :
    *************************************************************************************************
    Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
    Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
    Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
    library(rmr2);
    library(rhdfs);
    library(lubridate);
    hdfs.init();
    f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
    m = hdfs.read(f);
    c = rawToChar(m);
    data = read.table(textConnection(c), sep = ",");
    *************************************************************************************************
          I'm a newcomer of R, I have no idee about how to modify my code.
    Could you please give me some code snippet?

    Thanks in advance.




    Antonio Piccolboni於 2014年5月16日星期五UTC+8上午2時09分25秒寫道:
    - Show quoted text -
  • Antonio Piccolboni at May 16, 2014 at 9:59 pm
    Sorry, the basics of the R language are off topic for this group. I would
    advise you to strengthen your knowledge of R (e.g.
    cran.r-project.org/doc/manuals/R-intro.pdf) before moving on to distributed
    computing, RHadoop and the like. We need to walk before we can run.


    Antonio


    On Thu, May 15, 2014 at 8:42 PM, James Chang wrote:

    HI Antonio,

    Thanks for your kindly reply.
    My test code as following :

    *************************************************************************************************
    Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
    Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
    Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
    library(rmr2);
    library(rhdfs);
    library(lubridate);
    hdfs.init();
    f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
    m = hdfs.read(f);
    c = rawToChar(m);
    data = read.table(textConnection(c), sep = ",");
    *************************************************************************************************
    I'm a newcomer of R, I have no idee about how to modify my code.
    Could you please give me some code snippet?

    Thanks in advance.




    Antonio Piccolboni於 2014年5月16日星期五UTC+8上午2時09分25秒寫道:
    The loop would be in your R program, no patch required.


    Antonio

    On Thu, May 15, 2014 at 2:25 AM, James Chang wrote:

    Hi David,

    Thanks for your suggestion. So, we need some patch for the rhdfs
    package for R?
    Am I right? or I can do this in my R program?

    Thanks in advance!

    David Champagne於 2014年5月15日星期四UTC+8上午3時43分27秒寫道:
    rhdfs uses the java api for reading files stored in hdfs. That api
    will not necessarily read the entire file in one shot. It will return some
    number of bytes for each read. When it reaches the end of the file it
    returns -1. In the case of rhdfs, and end of the file will return NULL.
      So, you need to loop on the hdfs.read call until NULL is returned
    On Wednesday, May 7, 2014 4:43:33 AM UTC-7, James Chang wrote:

    Hi,
    I have many huge csv files(more 20GB) on my hortonworks HDP
    2.0.6.0 GA cluster,
    I use the following code to read file from HDFS:
    *************************************************************************************************
    Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
    Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
    Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
    library(rmr2);
    library(rhdfs);
    library(lubridate);
    hdfs.init();
    f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
    m = hdfs.read(f);
    c = rawToChar(m);
    data = read.table(textConnection(c), sep = ",");
    *************************************************************************************************
    When I use dim(data) to verify, it show me as following:
    [1] 1523 7
    *************************************************************************************************
    But actually, it should be "134279407" instead of "1523".
    I found the value of m show in RStudio is "raw [1:131072] 50 72
    69 49 ...", and there is
    a thread in hadoop-hdfs-user mailing list(why can
    FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) .
    Ref.
    http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E
    Is it a bug of hdfs.read() in rhdfs-1.0.8?

    Best Regards,
    James Chang
    --
    post: rha...@googlegroups.com ||
    unsubscribe: rhadoop+u...@googlegroups.com ||

    web: https://groups.google.com/d/forum/rhadoop?hl=en-US
    ---
    You received this message because you are subscribed to the Google
    Groups "RHadoop" group.

1 comment:

  1. Try this.

    f =hdfs.file("hdfs://:9000/Test.cav","r",buffersize=104857600);

    ReplyDelete