RHadoop ›
| |
How to read hdfs file into data frame
4 posts by 3 authors
|
Jeff Zhang |
5/23/12
|
Hi ,
I am trying to read hdfs file into data frame using the following
content<-hdfs.read.text.file(" /user/jianfezhang/hive/wtx_ clickpath/dt=2012-05-20/ clickpath_2012-05-20.txt")
clickpath<-read.table( textConnection(content),sep="\ 001")
But the performance is very bad even when the file is only 30 megabyte. And seems most of time is spent on the second statement.
So
is there any other way to read hdfs file into data.frame ? Just like
read a local file rather than reading file into memory first.
Click here to Reply
David Champagne |
5/24/12
|
You
could try something like the following (see below). I am sure this is
not optimized and the "plyr" package has some better ways of merging
dataframes together. We'll take a look, and see if we can come up with
an optimized way of reading in delimited files to data frames
f3<-function() {
#preallocate a list with a 2000 items
result<-vector("list", 2000)
i = 1
#open file and read 1000 lines at a time
handle <-hdfs.line.reader("/tmp/ testdata/test.csv")
content <- handle$read()
while(length(content) != 0) {
#convert lines from the file and add it to the list
result[[i]]<-read.table( textConnection(content),sep=", ")
#read the next 1000 lines
content <-handle$read()
i <- i + 1
}
#create the final data.frame
out<-do.call("rbind",result)
#close the file
handle$close()
}
system.time(f3())
f3<-function() {
#preallocate a list with a 2000 items
result<-vector("list", 2000)
i = 1
#open file and read 1000 lines at a time
handle <-hdfs.line.reader("/tmp/
content <- handle$read()
while(length(content) != 0) {
#convert lines from the file and add it to the list
result[[i]]<-read.table(
#read the next 1000 lines
content <-handle$read()
i <- i + 1
}
#create the final data.frame
out<-do.call("rbind",result)
#close the file
handle$close()
}
system.time(f3())
- show quoted text -
David Champagne |
5/25/12
|
Here's an example of a more efficient way of reading a delimited text file from hdfs into a dataframe using a 'pipe'
out<-read.table(pipe("hadoop dfs -cat '/tmp/testdata/test.csv'"), sep=",", header=TRUE)
out<-read.table(pipe("hadoop dfs -cat '/tmp/testdata/test.csv'"), sep=",", header=TRUE)
- show quoted text -
Hadley Wickham |
5/25/12
|
There are a few things you could do to make the function faster:
library(plyr)
f4 <- function() {
result <- list()
i <- 1
# Grab 10,000 lines at a time - my gut feeling is that 1000 is too
small
# and your running time will be dominate by communication overhead.
You
# might try adjusting up even further.
handle <-hdfs.line.reader("/tmp/ testdata/test.csv", 10000)
on.exit(handle$close()) # ensure handle always closed even if error
content <- handle$read()
while(length(content) != 0) {
# Generally you don't want to convert strings to factors, and it
# makes the code slightly faster
result[[i]] <- read.csv(textConnection( content), stringsAsFactors
= FALSE)
content <- handle$read()
# do.call(rbind)
rbind.fill(result)
}
system.time(f4())
Also note you don't need to pre-allocate lists - unlike atomic
vectors, they are not copied when you add a new element:
x <- rnorm(1e7)
a <- list(x)
b <- x
# Adding a new element to a list doesn't require a copy
system.time(a$b <- 1)
# Adding a new element to a vector does
system.time(x[1e7 + 1] <- 1)
system.time(a[1e7 + 1] <- 1)
Hadley
library(plyr)
f4 <- function() {
result <- list()
i <- 1
# Grab 10,000 lines at a time - my gut feeling is that 1000 is too
small
# and your running time will be dominate by communication overhead.
You
# might try adjusting up even further.
handle <-hdfs.line.reader("/tmp/
on.exit(handle$close()) # ensure handle always closed even if error
content <- handle$read()
while(length(content) != 0) {
# makes the code slightly faster
result[[i]] <- read.csv(textConnection(
= FALSE)
content <- handle$read()
i <- i + 1
}
# Use rbind.fill from plyr - this is a much faster implementation of
}
# do.call(rbind)
rbind.fill(result)
}
system.time(f4())
Also note you don't need to pre-allocate lists - unlike atomic
vectors, they are not copied when you add a new element:
x <- rnorm(1e7)
a <- list(x)
b <- x
# Adding a new element to a list doesn't require a copy
system.time(a$b <- 1)
# Adding a new element to a vector does
system.time(x[1e7 + 1] <- 1)
system.time(a[1e7 + 1] <- 1)
Hadley
How to write data frame to HDFS using rhdfs (grokbase)
I'm using rhdfs and have had success reading newline-delimited text files
using "hdfs.write.text.file". However, for writing to HDFS there is no
equivalent - only the byte-level "hfds.write".
If I have a data frame in R where the columns have simple string
representations (i.e. they are numeric or characters), what's the best way
to write it out to HDFS as a comma-seperated, newline-delimited text file?
Thanks,
Ben.
--
using "hdfs.write.text.file". However, for writing to HDFS there is no
equivalent - only the byte-level "hfds.write".
If I have a data frame in R where the columns have simple string
representations (i.e. they are numeric or characters), what's the best way
to write it out to HDFS as a comma-seperated, newline-delimited text file?
Thanks,
Ben.
--
Search Discussions
-
David Champagne at Aug 8, 2012 at 10:58 pm ⇧ You could try using a pipe. It may not be the most efficient, but should
work. Something like the following, where "df" is your dataframe
write.csv(df, file=pipe("hadoop dfs -put - /tmp/test.csv"))
- Show quoted text - -
David Champagne at Aug 8, 2012 at 11:19 pm ⇧ If you want to store the dataframe in HDFS and read it back into R at a
later time, then serializing the dataframe is the way you want to go.
This can be accomplished by the following:
Writing:
myfile <- hdfs.file("/tmp/myfilename", "w")
hdfs.write(df, myfile)
hdfs.close(myfile)
Reading:
myfile = hdfs.file("/tmp/myfilename", "r")
dfserialized <- hdfs.read(myfile)
df <- unserialize(dfserialized)
hdfs.close(myfile)
No comments:
Post a Comment