R & hadoop: August 2014

It is common to use R language to group and summarize data of files. Sometimes we may find ourselves processing comparatively big files which have smaller computed result and bigger source data. We cannot load them wholly to the memory when we need to compute them. The only solutions could be batch importing and computing as well as result merging. We’ll use an example in the following to illustrate the way of R language to group and summarize data from big text files.
Here is a file, sales.txt, of 1G size, which contains a great number of records of sales orders. We want to group field CLIENT and summarize field AMOUNT. “\t” is used in the file as the column separator. The first rows of data are as follows:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE
1       WVF Vip 5       440.0        2009-02-03
2       UFS Com 13     1863.4      2009-07-05
3       SWFR 2       1813.0      2009-07-08
4       JFS Pep 27     670.8        2009-07-08
5       DSG 15     3730.0      2009-07-09
6       JFE 10     1444.8      2009-07-10
7       OLF 16     625.2        2009-07-11

R’s solution:

1	con <- file("E: \\sales.txt", "r")
2	result=read.table(con,nrows=100000,sep="\t",header=TRUE)
3	result<-aggregate(result[,4],list(result[,2]),sum)
4	while(nrow(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t",col.names=c("ORDERID","Group.1","SELLERID","x","ORDERDATE")))!=0) {
5	databatch<-databatch[,c(2,4)]
6	result<-rbind(result,databatch)
7	result<-aggregate(result[,2],list(result[,1]),sum)
8	}
9	close(con)

Part of the computed result:

    Group.1         x

1       ARO  17981798

2       BDR  85584558

3       BON  51293129

4       BSF 287908788

5       CHO  23482348

Code interpretation:
The 1^st line: Open the file handle.
The 2^nd ~ 3^rd line: Import the first batch of 100,000 rows of data, group and summarize them and save the result in result.
The 4^th ~ 8^th line: Import data by loop, with 100,000 rows of data per batch, and store them in the variable databatch. Then get the second and fourth field, i.e. “CLIENT” and “AMOUNT”, merge databatch into result, and execute grouping operation.
It can be seen that, at a certain moment, only databatch, which includes 100,000 rows of data, and result, the summarizing result, have memory usage. Usually, the size of the latter is small and will not result in a memory overflow.
The 11^th line: Close the file handle.

Matters needing attention:
Data frame. Because the data frame of R language cannot directly perform the computing of big files, loop statement is necessary to help to do the job in this occasion. The steps are: import a batch of data and merge them into the data frame result; group and summarize result and then import the next batch of data. You can see that this part of code of loop statement is a little complicated.
Column name. As the first row of data is the column name, header=TRUE can be used in the first batch of data to directly set the column name. But the subsequent data hasn’t column names and header=FALSE should be used to import data. The default column names are V1, V2 and so forth when header=FALSE is used. But the default column names are Group.1 and x after grouping and summarizing are executed, and col.names is needed to change the column names in order to maintain structure consistency both before and after grouping and summarizing and set the stage for the subsequent merging. The code about column names is worth our notice because it is easy to get wrong.

Alternative solutions:

Python, esProc and Perl can also perform the same operation. They can execute the grouping and summarizing of data from big text files and the subsequent structured data computing as R language does. We’ll briefly introduce the coding methods used by esProc and Python.
esProc can process data in batches automatically, which requires no manual control from the programmers by loop statement and produces quite simple code:

	A
1	=file("e:/sales.txt").cursor@t()
2	=A2.groups(CLIENT;sum(AMOUNT))

Cursor is a data type used for structured data computing in esProc. Its usage is similar to that of the data frame, but it is better at processing big files and performing complicated computations. What’s more, @t option in the code indicates that the first line of the file is the column name. So it is convenient to use the column name directly in subsequent computation.
Python’s code structure, which also requires manual loop control, is similar to that of R language. But Python itself hasn’t the structured data type, like data frame or cursor, so its code is executed in a lower level:

1	from itertools import groupby
2	from operator import itemgetter
3	result = []
4	myfile = open("E:\\sales.txt",'r')
5	BUFSIZE = 10240000
6	myfile.readline()
7	lines = myfile.readlines(BUFSIZE)
8	value=0
9	while lines:
10	for line in lines:
11	record=line.split('\t')
12	result.append([record[1],float(record[3])])
13	result=sorted(result,key=lambda x:(x[0])) # the sorting before grouping is executed
14	batch=[]
15	for key, items in groupby(result, itemgetter(0)): # group using groupBy function
16	value=0
17	for subItem in items:value+=subItem[1]
18	batch.append([key,value]) # finally, merger the summarizing results into a two-dimensional array
19	result=batch
20	lines = myfile.readlines(BUFSIZE)
21	myfile.close()

Except for the above two-dimensional array, Python can execute the operation with the third-party packages. For example, pandas has the structured data object similar to the data frame. pandas simplifies the code in a similar way as R language. But it lacks sufficient ability to perform big file computing, thus loop statement is still needed while programming.

R & hadoop

Thursday, 28 August 2014

A Method of Grouping and Summarizing Data of Big Text Files in R Language