Big Data Analysis using ff and ffbase

Native R stores everything into RAM. For more, please visit R memory management. R objects can take memory upto 2-4 GB, depends on hardware configuration. Beyond this, it returns “Error: cannot allocate vector of size ……” and leaving us handicapped to work with big data using R.
data storage using native R
              Data Storage with standard R Object
Thanks to R open source, group of scholars who continuously strives in creating R packages which help us to work effectively with big data.
ff package developed by Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini and maintained by Jens Oehlschlägel is designed to overcome this limitation. It uses other media like hard disk, CD and DVD to store the native binary flat files rather than its memory. It also allows you to work on very large data file simultaneously. It reads the data files into chunk and write that chunk into the external drive.
Data storage with ff
                              Data Storage with ff
Read csv file using ff package
>options(fftempdir = [Provide path where you want to store binary files])
>file_chunks <- read.csv.ffdf(file=”big_data.csv”, header=T, sep=”,”, VERBOSE=T, next.rows=500000, colClasses=NA)
It read big_data csv file chunk by chunk as specified in next.rows. It reads the chunks and write binary files in any external media and store the pointer of file in RAM. It perform this step until csv file left with no chunks.
ff working
Functioning of ff package
In the same way, we can write csv or other flat files in chunk. It reads chunk by chunk from HDD or any other external media and write it into csv or other supported format.
>write.csv.ffdf(File_chunks, “file_name.csv”)
ff provides us the facility with ffbase package to implement all sorts of functions like joins, aggregations, slicing and dicing.
>Merged_data = merge(ffobject1, ffobject2, by.x=c(“Col1″, “Col2″), by.y=c(“Col1″,”Col2″), trace=T)
Merge function of ff, ffbase package works similar as it worked for data frame but it allows inner and left join only.
>library(“doBy”)
>AggregatedData = ffdfdply(ffobject, split=as.character(ffobject$Col1), FUN=function(x) summaryBy(Col3+Col4+Col5 ~ Col1, data=x, FUN=sum))
To perform aggregation, I used summaryBy function which is available under doBy package. In the above ffdfdply function we split the data on the basis of some key column. If key column contains combination of 2 or more fields, we can generate key columns using ikey function
>ffobject$KeyColumn <- ikey(ffobject[c("Col1","Col2","Col3")])
With all sorts of advantages like working with big data and less dependency on RAM, ff has few limitations, such as
1. Sometimes, we need to compromise with the speed when we are performing complex operations with huge data set.
2. Development is not easier using ff.
3. Need to care about flat files that stores in the disk otherwise your HDD or external media left with little or no space.

#####################

Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:
all <- read.csv("file.csv")
However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:
library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)
Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

20 comments:

  1. I am impressed by the information that you have on this blog. It shows how well you understand this subject.
    Data Analytics Courses

    ReplyDelete
  2. I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
    machine learning course in bangalore

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. ohh what a great topic. It help me a lot.
    You will get an introduction to the Python programming language and understand the importance of it. How to download and work with Python along with all the basics of Anaconda will be taught. You will also get a clear idea of downloading the various Python libraries and how to use them.
    Topics
    About ExcelR Solutions and Innodatatics
    Do's and Don’ts as a participant
    Introduction to Python
    Installation of Anaconda Python
    Difference between Python2 and Python3
    Python Environment
    Operators
    Identifiers
    Exception Handling (Error Handling)

    ReplyDelete
  6. Thank you for sharing valuable information.
    You can also check this links. Hope it can help you and your friend, who is looking for job change.
    If you have up to 10 years experience, then you can come for interview round. For more details you can direct call on this number @ +91 9250406614.
    It is totally free service. Free Free Free !!
    Jd for Group Manager (Oil and Gas research) - Gurgaon
    Thank you.

    ReplyDelete
  7. I have to voice my passion for your kindness giving support to those people that should have guidance on this important matter.
    AI training chennai | AI training class chennai
    Cloud computing training | cloud computing class chennai



    ReplyDelete

  8. I was basically inspecting through the web filtering for certain data and ran over your blog. I am flabbergasted by the data that you have on this blog. It shows how well you welcome this subject. Bookmarked this page, will return for extra. data science course in jaipur

    ReplyDelete
  9. Offering the most effective collection of knowledge in real-time to students through experts and creating them into industry Python Training in Hyderabad with the help of experts from AI Patasala.
    Python Training Hyderabad

    ReplyDelete
  10. I think it could be more general if you get a football sports activity data science training in surat

    ReplyDelete
  11. The worst part of it was that the software only worked intermittently and the data was not accurate. You obviously canot confront anyone about what you have discovered if the information is not right. data science course in mysore

    ReplyDelete