RHadoop : Reading CSV using rhdfs

    RHadoop : Reading CSV using rhdfs


Here is a small code snippet on how to read the csv data from HDFS using rhdfs (RHadoop)

rhdfs uses rJava and the buffersize is limited by the heapsize. By default the size of the buffer is set to 5Mb in rhdfs. The source code for rhdfs can be found here.

HADOOP_CMD environment should point to the hadoop.

Sys.setenv(HADOOP_CMD="/bin/hadoop")

library(rhdfs)
hdfs.init()

f = hdfs.file("fulldata.csv","r",buffersize=104857600)
m = hdfs.read(f)
c = rawToChar(m)

data = read.table(textConnection(c), sep = ",")

## Alternatively You can use hdfs.line.reader()

reader = hdfs.line.reader("fulldata.csv")
 
x = reader$read()
typeof(x)
## [1] "character"

  1. Could you please give me some hint? Following is my code snippet:
    ==========================================================
    library(rmr2);
    library(rhdfs);
    library(lubridate);
    hdfs.init();
    f = hdfs.file("/bigdata/rawdata/201312.csv","r",buffersize=104857600);
    m = hdfs.read(f);
    c = rawToChar(m);
    data = read.table(textConnection(c), sep = ",");
    ==========================================================

    thanks in advance.

4 comments:

  1. i spend almost 1 and half day finding how can i connect R with hadoop , try many lines of code but nothing work like your 4 lines of codes , thanks a lottt!!!

    ReplyDelete
  2. Beldex coin is one of the pioneers in the field of Hybrid Decentralized Exchange along with a dedicated Crypto Debit card that makes the usage of cryptocurrency in everyday life seamlessly.

    Hybrid Decentralized Crypto Exchange | Best Crypto Exchange | Beldexcoin

    ReplyDelete
  3. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading Python training in pune new articles. Keep up the good work!

    ReplyDelete