How to read hdfs file into data frame

RHadoop
How to read hdfs file into data frame
4 posts by 3 authors




Jeff Zhang

5/23/12

Hi ,

I am trying to read hdfs file into data frame using the following

content<-hdfs.read.text.file("/user/jianfezhang/hive/wtx_clickpath/dt=2012-05-20/clickpath_2012-05-20.txt")
clickpath<-read.table(textConnection(content),sep="\001")

But the performance is very bad even when the file is only 30 megabyte. And seems most of time is spent on the second statement.

So is there any other way to read hdfs file into data.frame ? Just like read a local file rather than reading file into memory first.



Click here to Reply


David Champagne

5/24/12

You could try something like the following (see below).  I am sure this is not optimized and the "plyr" package has some better ways of merging dataframes together.   We'll take a look, and see if we can come up with an optimized way of reading in delimited files to data frames

f3<-function() {
  #preallocate a list with a 2000 items
  result<-vector("list", 2000)
  i = 1
   #open file and read 1000 lines at a time
  handle <-hdfs.line.reader("/tmp/testdata/test.csv")
  content <- handle$read()
  while(length(content) != 0) {
        #convert lines from the file and add it to the list
    result[[i]]<-read.table(textConnection(content),sep=",")
        #read the next 1000 lines
        content <-handle$read()
    i <- i + 1
  }
  #create the final data.frame
  out<-do.call("rbind",result)   
  #close the file
  handle$close()

}

system.time(f3())
- show quoted text -


David Champagne

5/25/12

Here's an example of a more efficient way of reading a delimited text file from hdfs into a dataframe using a 'pipe'

out<-read.table(pipe("hadoop dfs -cat '/tmp/testdata/test.csv'"), sep=",", header=TRUE)
- show quoted text -


Hadley Wickham

5/25/12

There are a few things you could do to make the function faster:

library(plyr)
f4 <- function() {
  result <- list()
  i <- 1

  # Grab 10,000 lines at a time - my gut feeling is that 1000 is too
small
  # and your running time will be dominate by communication overhead.
You
  # might try adjusting up even further.
  handle <-hdfs.line.reader("/tmp/testdata/test.csv", 10000)
  on.exit(handle$close())  # ensure handle always closed even if error

  content <- handle$read()
  while(length(content) != 0) {
    # Generally you don't want to convert strings to factors, and it
    # makes the code slightly faster
    result[[i]] <- read.csv(textConnection(content), stringsAsFactors
= FALSE)
    content <- handle$read()
    i <- i + 1
  }
  # Use rbind.fill from plyr - this is a much faster implementation of
  # do.call(rbind)
  rbind.fill(result)
}

system.time(f4())

Also note you don't need to pre-allocate lists - unlike atomic
vectors, they are not copied when you add a new element:

x <- rnorm(1e7)
a <- list(x)
b <- x

# Adding a new element to a list doesn't require a copy
system.time(a$b <- 1)
# Adding a new element to a vector does
system.time(x[1e7 + 1] <- 1)

system.time(a[1e7 + 1] <- 1)

Hadley



How to write data frame to HDFS using rhdfs (grokbase)


I'm using rhdfs and have had success reading newline-delimited text files
using "hdfs.write.text.file". However, for writing to HDFS there is no
equivalent - only the byte-level "hfds.write".

If I have a data frame in R where the columns have simple string
representations (i.e. they are numeric or characters), what's the best way
to write it out to HDFS as a comma-seperated, newline-delimited text file?

Thanks,
Ben.

--

Search Discussions

2 responses

  • David Champagne at Aug 8, 2012 at 10:58 pm
    You could try using a pipe. It may not be the most efficient, but should
    work. Something like the following, where "df" is your dataframe

    write.csv(df, file=pipe("hadoop dfs -put - /tmp/test.csv"))
    - Show quoted text -
  • David Champagne at Aug 8, 2012 at 11:19 pm
    If you want to store the dataframe in HDFS and read it back into R at a
    later time, then serializing the dataframe is the way you want to go.
    This can be accomplished by the following:

    Writing:

    myfile <- hdfs.file("/tmp/myfilename", "w")
    hdfs.write(df, myfile)
    hdfs.close(myfile)

    Reading:

    myfile = hdfs.file("/tmp/myfilename", "r")
    dfserialized <- hdfs.read(myfile)
    df <- unserialize(dfserialized)
    hdfs.close(myfile)

No comments:

Post a Comment