R & hadoop: Small File problem in Hadoop

What is small file problem?

The Hadoop Distributed File System (HDFS) is a distributed file system. It is mainly designed for batch processing of large volume of data. The default block size of HDFS is 64MB. When data is represented in files significantly smaller than the default block size the performance degrades dramatically. Mainly there are two reasons for producing small files. One reason is some files are pieces of a larger logical file (e.g. - log files). Since HDFS has only recently supported appends, these unbounded files are saved by writing them in chunks into HDFS. Other reason is some files cannot be combined together into one larger file and are essentially small. e.g. - A large corpus of images where each image is a distinct file.

Problems with small files and HDFS

Storing lot of small files which are extreamly smaller than the block size cannot be efficiently handled by HDFS. When small files are used there will be lots of seeks and lots of hopping from datanode to datanode to retrieve each small file which is an inefficient data access pattern.
In namenode’s memory every file, directory and block in HDFS is represented as an object. Each of these objects is in the size of 150 bytes. If we consider a 10M files, each of these file will be using a separate block. That will cause to a use of 3 gigabytes of memory. With the hardware limitations have, scaling up beyond this level is a problem.

Problems with small files and MapReduce

In MapReduce, map tasks process a block of input at a time. When the file size becomes very small the input for each process will be little and there will be lot of files, so that lot of map tasks. For an example, let’s consider 16GB file broken into 16 blocks of 64MB with 10,000 files of the size of 100KB. In this case these each file uses one map each. So that the job time will be tens or hundreds of times slower than the equivalent one with a single input file.

Hand-on experiment results

Hardware : Pentium(R)Dual-Core CPU 2.00GHz,32-bit, 3GB of RAM and running Ubuntu 11.10 operating system on a Linux Kernel 3.0.0-12 generic.
Hadoop: hadoop-0.20.203.0 version with default settings.
Input: 1000 files with the size of 100bytes each at the first run and a whole one file of the size 100KB.
Job: Counting distinct words occurring times.

Results:

With small 1000 files of 100bytes each

With 1 file of 100KB

With small 1000 files
Time taken = 23.23.30-22.32.43 = 00.51.47hr= 51 minutes and 47 seconds
With 1 big file of size 100,000bytes(100KB)
Time taken = 01.05.12-01.04.44= 00.00.28hr=28 seconds

Existing solutions

Consolidator- Consolidator takes a set of files containing records belonging to the same logical file & merges the files together into larger files. It is possible to merge all the files into a one large file, but it is not practical as then it would be a terabytes sized file. It would take a longer time to run such a huge file. Therefore instead of that, Consolidator has a parameter for "desired file size" where user can define the maximum file size of a merged file. In this way Consolidator balances its speed with the desired size of files. "Desired file size" can be set to some multiples of the HDFS block size so that the input splits are larger to optimize for locality.
HAR files - This structure has lowered down the pressure on the memory of namenode being transparent to any application accessing original files by reducing the files at HDFS. But thi may be slower than reading HDFS as each HAR file access requires two ‘index’ file reads as well as the data file read. A technique to gain the reduced speed will be to take advantage of the improved locality of files in HARs which is not implemented yet.
Using HBase storage - HBase stores data in MapFiles (indexed Sequence Files) and therefore it is a good choice when it is need to do MapReduce style streaming analyses with the occasional random look up. For small files problems where a large number of files are generated, a different type of storage such as HBase is much appropriate depending on the access pattern.
Sequence Files - Sequence files is a Hadoop specific archive file format similar to tar and zip. The concept behind this is to merge the file set with using a key and a value pair and this created files known as ‘Hadoop Sequence Files’. In this method file name is used as the key and the file content is used as value.

Reference - http://www.cloudera.com/blog/2009/02/the-small-files-problem/

R & hadoop

Small File problem in Hadoop

What is small file problem?

Problems with small files and HDFS

Problems with small files and MapReduce

Hand-on experiment results

Existing solutions

No comments:

Post a Comment