RHadoop ›

How to read hdfs file into data frame 4 posts by 3 authors

Jeff Zhang

5/23/12

Hi ,

I am trying to read hdfs file into data frame using the following

content<-hdfs.read.text.file("/user/jianfezhang/hive/wtx_clickpath/dt=2012-05-20/clickpath_2012-05-20.txt")

clickpath<-read.table(textConnection(content),sep="\001")

But the performance is very bad even when the file is only 30 megabyte. And seems most of time is spent on the second statement.

So is there any other way to read hdfs file into data.frame ? Just like read a local file rather than reading file into memory first.

Click here to Reply

David Champagne

5/24/12

You could try something like the following (see below). I am sure this is not optimized and the "plyr" package has some better ways of merging dataframes together.   We'll take a look, and see if we can come up with an optimized way of reading in delimited files to data frames

f3<-function() {
#preallocate a list with a 2000 items
result<-vector("list", 2000)
i = 1
   #open file and read 1000 lines at a time
handle <-hdfs.line.reader("/tmp/testdata/test.csv")
content <- handle$read()
while(length(content) != 0) {
        #convert lines from the file and add it to the list
    result[[i]]<-read.table(textConnection(content),sep=",")
        #read the next 1000 lines
        content <-handle$read()
    i <- i + 1
}
#create the final data.frame
out<-do.call("rbind",result)
#close the file
handle$close()

}

system.time(f3())

- show quoted text -

David Champagne

5/25/12

Here's an example of a more efficient way of reading a delimited text file from hdfs into a dataframe using a 'pipe'

out<-read.table(pipe("hadoop dfs -cat '/tmp/testdata/test.csv'"), sep=",", header=TRUE)

- show quoted text -

Hadley Wickham

5/25/12

There are a few things you could do to make the function faster:

library(plyr)
f4 <- function() {
result <- list()
i <- 1

# Grab 10,000 lines at a time - my gut feeling is that 1000 is too
small
# and your running time will be dominate by communication overhead.
You
# might try adjusting up even further.
handle <-hdfs.line.reader("/tmp/testdata/test.csv", 10000)
on.exit(handle$close()) # ensure handle always closed even if error

content <- handle$read()
while(length(content) != 0) {

# Generally you don't want to convert strings to factors, and it
# makes the code slightly faster
result[[i]] <- read.csv(textConnection(content), stringsAsFactors
= FALSE)
content <- handle$read()

i <- i + 1
}

# Use rbind.fill from plyr - this is a much faster implementation of
# do.call(rbind)
rbind.fill(result)
}

system.time(f4())

Also note you don't need to pre-allocate lists - unlike atomic
vectors, they are not copied when you add a new element:

x <- rnorm(1e7)
a <- list(x)
b <- x

# Adding a new element to a list doesn't require a copy
system.time(a$b <- 1)
# Adding a new element to a vector does
system.time(x[1e7 + 1] <- 1)

system.time(a[1e7 + 1] <- 1)

Hadley

How to write data frame to HDFS using rhdfs (grokbase)

I'm using rhdfs and have had success reading newline-delimited text files
using "hdfs.write.text.file". However, for writing to HDFS there is no
equivalent - only the byte-level "hfds.write".

If I have a data frame in R where the columns have simple string
representations (i.e. they are numeric or characters), what's the best way
to write it out to HDFS as a comma-seperated, newline-delimited text file?

Thanks,
Ben.

--

Search Discussions

2 responses

Nested Oldest

David Champagne at Aug 8, 2012 at 10:58 pm ⇧

You could try using a pipe. It may not be the most efficient, but should
work. Something like the following, where "df" is your dataframe

write.csv(df, file=pipe("hadoop dfs -put - /tmp/test.csv"))
- Show quoted text -
reply | permalink
David Champagne at Aug 8, 2012 at 11:19 pm ⇧

If you want to store the dataframe in HDFS and read it back into R at a
later time, then serializing the dataframe is the way you want to go.
This can be accomplished by the following:

Writing:

myfile <- hdfs.file("/tmp/myfilename", "w")
hdfs.write(df, myfile)
hdfs.close(myfile)

Reading:

myfile = hdfs.file("/tmp/myfilename", "r")
dfserialized <- hdfs.read(myfile)
df <- unserialize(dfserialized)
hdfs.close(myfile)

R & hadoop

How to read hdfs file into data frame

How to write data frame to HDFS using rhdfs (grokbase)

Search Discussions

2 responses

No comments:

Post a Comment