Storing Apache Hadoop Data on the Cloud - HDFS vs. S3
Ken and Ryu are both the best of friends and the greatest
of rivals in the Street Fighter game series. When it comes to Hadoop
data storage on the cloud though, the rivalry lies between Hadoop
Distributed File System (HDFS) and Amazon’s Simple Storage Service (S3).
Although Apache Hadoop traditionally works with HDFS, it can also use
S3 since it meets Hadoop’s file system requirements. Netflix’ Hadoop data warehouse
utilizes this feature and stores data on S3 rather than HDFS. Why did
Netflix choose this data architecture? To understand their motives let’s
see how HDFS and S3 do... in battle!Round 1 - Scalability
HDFS relies on local storage that scales horizontally.
Increasing storage space means either adding larger hard drives to
existing nodes or adding more machines to the cluster. This is feasible,
but more costly and complicated than S3.
S3 automatically scales according
to the stored data, no need to change anything. Even the space available
is practically infinite (no limit from Amazon at least).First round goes to S3
Round 2 - Durability
A statistical model for HDFS data durability
suggests that the probability of losing a block of data (64MB by
default) on a large 4,000 node cluster (16 PB total storage, 250,736,598
block replicas) is 5.7x10-7 in the next 24 hours and 2.1x10-4 in the
next 365 days. However, for most clusters, which contain only a few
dozen instances, the probability of losing data can be much higher.
S3 provides a durability of
99.999999999% of objects per year, meaning that a single object could be
lost per 10,000 objects once every 10,000,000 years (see the S3 FAQ).
It gets even better. About a year and a half ago one of my colleagues
at Xplenty took an AWS workshop at Amazon. Their representatives claimed
that they haven’t actually lost a single object in the default S3
storage (a cheaper option of Reduced Redundancy Storage (RRS) is
available with a durability of only 99.99%).Large clusters may have excellent durability, but in most cases S3 is more durable than HDFS.
S3 wins again
Round 3 - Persistence
Data doesn’t persist when stopping EC2 or EMR instances. Nonetheless, costly EBS volumes can be used to persist the data on EC2.
On S3 the data always persists.S3 wins
Round 4 - Price
To keep data integrity, HDFS stores by default three copies
of each block of data. This means that the storage space needed by HDFS
is 3 times than the actual data and costs accordingly. Although the
data replication is not mandatory, storing just one copy would eliminate
the durability of HDFS and could result in loss of data.
Amazon take care of backing up the
data on S3, so 100% of the space is available and paid for. S3 also
supports storing compressed files which considerably reduces the space
needed as well as the bill.Winner - S3
Round 5 - Performance
HDFS performance is great. The data is stored and processed on the same machines which improves access and processing speed.
Unfortunately S3 doesn’t perform as well as HDFS. The
latency is obviously higher and the data throughput is lower. However,
jobs on Hadoop are usually made of chains of map-reduce jobs and
intermediate data is stored into HDFS and the local file system so other
than reading from/writing to Amazon S3 you get the throughput of the
nodes' local disks.
We ran some tests recently with
TestDFSIO, a read/write test utility for Hadoop, on a cluster of
m1.xlarge instances with four ephemeral disk devices per node. The
results confirm that HDFS performs better:HDFS on Ephemeral Storage | Amazon S3 | |
---|---|---|
Read | 350 mbps/node | 120 mbps/node |
Write | 200 mbps/node | 100 mbps/node |
Round 6 - Security
Some people think that HDFS is not secure, but that’s not
true. Hadoop provides user authentication via Kerberos and authorization
via file system permissions. YARN, Hadoop’s latest version, takes it
even further with a new feature called federations - dividing a cluster
into several namespaces which prevents users from accessing data that
does not belong to them. Data can be uploaded to the amazon instances
securely via SSL.
S3 has built-in security. It
supports user authentication to control who has access to the data,
which at first only the bucket and object owners do. Further permissions
can be granted to users and groups via bucket policies and Access
Control Lists (ACL). Data can be encrypted and uploaded securely via
SSL.It’s a tie
Round 7 - Limitations
Even though HDFS can store files of any size, it has issues storing really small files (they should be concatenated or unified to Hadoop Archives).
Also, data saved on a certain cluster is only available to machines on
that cluster and cannot be used by instances outside the cluster.
That’s not the case with S3. The
data is independent of Hadoop clusters and can be processed by any
number of clusters simultaneously. However, files on S3 have several
limitations. They can have a maximum size of only 5GB and additional
Hadoop storage formats, such as Parquet or ORC,
cannot be used on S3. That’s because Hadoop needs to access particular
bytes in these files, an ability that’s not provided by S3.Another tie
Tonight’s Winner
With better scalability, built-in persistence, and lower
prices, S3 is tonight’s winner! Nonetheless, for better performance and
no file sizes or storage formats limitations, HDFS is the way to go.
No comments:
Post a Comment