Big Data Analysis using ff and ffbase

Native R stores everything into RAM. For more, please visit R memory management. R objects can take memory upto 2-4 GB, depends on hardware configuration. Beyond this, it returns “Error: cannot allocate vector of size ……” and leaving us handicapped to work with big data using R.

Data Storage with standard R Object
Thanks to R open source, group of scholars who continuously strives in creating R packages which help us to work effectively with big data.
ff package developed by Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini and maintained by Jens Oehlschlägel is designed to overcome this limitation. It uses other media like hard disk, CD and DVD to store the native binary flat files rather than its memory. It also allows you to work on very large data file simultaneously. It reads the data files into chunk and write that chunk into the external drive.

Data Storage with ff
Read csv file using ff package
>options(fftempdir = [Provide path where you want to store binary files])
>file_chunks <- read.csv.ffdf(file=”big_data.csv”, header=T, sep=”,”, VERBOSE=T, next.rows=500000, colClasses=NA)
It read big_data csv file chunk by chunk as specified in next.rows. It reads the chunks and write binary files in any external media and store the pointer of file in RAM. It perform this step until csv file left with no chunks.

Functioning of ff package

In the same way, we can write csv or other flat files in chunk. It reads chunk by chunk from HDD or any other external media and write it into csv or other supported format.
>write.csv.ffdf(File_chunks, “file_name.csv”)
ff provides us the facility with ffbase package to implement all sorts of functions like joins, aggregations, slicing and dicing.
>Merged_data = merge(ffobject1, ffobject2, by.x=c(“Col1″, “Col2″), by.y=c(“Col1″,”Col2″), trace=T)
Merge function of ff, ffbase package works similar as it worked for data frame but it allows inner and left join only.
>library(“doBy”)
>AggregatedData = ffdfdply(ffobject, split=as.character(ffobject$Col1), FUN=function(x) summaryBy(Col3+Col4+Col5 ~ Col1, data=x, FUN=sum))
To perform aggregation, I used summaryBy function which is available under doBy package. In the above ffdfdply function we split the data on the basis of some key column. If key column contains combination of 2 or more fields, we can generate key columns using ikey function
>ffobject$KeyColumn <- ikey(ffobject[c("Col1","Col2","Col3")])
With all sorts of advantages like working with big data and less dependency on RAM, ff has few limitations, such as
1. Sometimes, we need to compromise with the speed when we are performing complex operations with huge data set.
2. Development is not easier using ff.
3. Need to care about flat files that stores in the disk otherwise your HDD or external media left with little or no space.

#####################

Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

19 comments:

Cocentrus16 March 2019 at 05:44
Latest Jobs in Fashion Industry in Delhi, NCR | 9250406614
ReplyDelete
Replies
jefrin24 April 2019 at 05:14
It’s great to come across a blog every once in a while that isn’t the same out of date rehashed

material. Fantastic read.
Java Training in Chennai |Best

Java Training in Chennai
C C++ Training

in Chennai |Best C C++ Training Institute in Chennai
Data science Course

Training in Chennai |Best Data Science Training Institute in Chennai
RPA Course

Training in Chennai |Best RPA Training Institute in Chennai
AWS Course Training

in Chennai |Best AWS Training Institute in Chennai
Devops Course Training

in Chennai |Best Devops Training Institute in Chennai
Selenium Course Training in

Chennai |Best Selenium Training Institute in Chennai
Java Course Training in Chennai |

Best Java Training Institute in Chennai
ReplyDelete
Replies
Karthick Ganesan29 April 2019 at 07:17
Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise? I get so much lately it’s driving me insane, so any assistance is very much appreciated.

Machine Learning Training in Chennai | Machine Learning Training Institute in Chennai
Devops Training in Chennai | Devops Training Institute in Chennai
Data Science Training in Chennai | Data Science Course in Chennai
Selenium Training in Chennai | Selenium Training Institute in Chennai
Blue Prism Training in Chennai | Blue Prism Training Institute in Chennai
PHP Training in Chennai | PHP Training Institute in Chennai
ReplyDelete
Replies
Anonymous22 May 2019 at 06:56
I am impressed by the information that you have on this blog. It shows how well you understand this subject.
Data Analytics Courses
ReplyDelete
Replies
inzata22 May 2019 at 22:22
Nice Post Thanks For Sharing

Big data analysis
ReplyDelete
Replies
Anonymous24 May 2019 at 03:03
I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
machine learning course in bangalore
ReplyDelete
Replies
Data analytic11 June 2019 at 21:35
This comment has been removed by the author.
ReplyDelete
Replies
Data anlytic11 June 2019 at 21:41
This comment has been removed by the author.
ReplyDelete
Replies
Data anlytic11 June 2019 at 21:44
ohh what a great topic. It help me a lot.
You will get an introduction to the Python programming language and understand the importance of it. How to download and work with Python along with all the basics of Anaconda will be taught. You will also get a clear idea of downloading the various Python libraries and how to use them.
Topics
About ExcelR Solutions and Innodatatics
Do's and Don’ts as a participant
Introduction to Python
Installation of Anaconda Python
Difference between Python2 and Python3
Python Environment
Operators
Identifiers
Exception Handling (Error Handling)
ReplyDelete
Replies
datasciencecourse15 June 2019 at 15:25
This article really compertable for us thaks for shearing
ReplyDelete
Replies
Cocentrus21 June 2019 at 05:00
Thank you for sharing valuable information.
You can also check this links. Hope it can help you and your friend, who is looking for job change.
If you have up to 10 years experience, then you can come for interview round. For more details you can direct call on this number @ +91 9250406614.
It is totally free service. Free Free Free !!
Jd for Group Manager (Oil and Gas research) - Gurgaon
Thank you.
ReplyDelete
Replies
rohini murali15 October 2019 at 01:47
good article Surya Informatics
ReplyDelete
Replies
ramya5 November 2019 at 03:23
I have to voice my passion for your kindness giving support to those people that should have guidance on this important matter.
AI training chennai | AI training class chennai
Cloud computing training | cloud computing class chennai

ReplyDelete
Replies
digital marketing training in noida20 January 2020 at 03:41
python training in noida sector 15
sap sd training center in noida
linux training center in noida sector 62
Email marketing training course in noida sector 15
salesforce training in noida sector 63
openstack training in noida sector 15
ReplyDelete
Replies
digital marketing training in noida20 January 2020 at 04:19
digital marketing training center in noida sector 16
Ui Path training center in Noida
digital marketing training center in noida sector 18
linux training center in noida sector 15
python training in noida sector 62
ReplyDelete
Replies
digital marketing training in noida20 January 2020 at 04:58
Email marketing training course in noida sector 62
devops training in center noida
blue prism training center in noida
cloud computing training in noida sector 15
hadoop training center in noida
python training in noida sector 15
ReplyDelete
Replies
data science14 October 2021 at 01:46

I was basically inspecting through the web filtering for certain data and ran over your blog. I am flabbergasted by the data that you have on this blog. It shows how well you welcome this subject. Bookmarked this page, will return for extra. data science course in jaipur
ReplyDelete
Replies
Ramesh Sampangi4 November 2021 at 02:09
Offering the most effective collection of knowledge in real-time to students through experts and creating them into industry Python Training in Hyderabad with the help of experts from AI Patasala.
Python Training Hyderabad
ReplyDelete
Replies

Add comment