R & hadoop: AWS Cost Saving Tip 11: How elastic thinking saves cost in Amazon EMR Clusters ?

Introduction: Amazon Elastic MapReduce (EMR) is a web service that helps customers with big data processing using Hadoop framework on EC2 and S3. Amazon Elastic MapReduce lets customers focus on crunching data instead of worrying about time-consuming set-up, management or tuning of Hadoop clusters or the EC2 capacity upon which they operate. This in built automation provided by AWS already saves huge labor cost for the customers.

What does the word Elastic mean in Hadoop/EMR context ? Ans: You can dynamically increase the number of processing nodes depending upon the volume/velocity of the data. Adding or removing servers takes minutes, which is much faster than making similar changes in clusters running on physical servers. Let us explore this in detail and analyse how it will help you save costs in AWS Big data processing.

Components:
Before getting into the savings part, lets understand the composition of an Amazon EMR Cluster. An Amazon EMR cluster consists of following server components. They are:
Master Node: This node Manages the cluster, it coordinates the distribution of the MapReduce executable and subsets of the raw data, to the core and task nodes. There is only one master node in a cluster. You cannot expand or reduce your Master Node in the EMR Cluster.
Core Node(s): A core node is an EC2 instance that runs Hadoop map/reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node. You can add more core nodes to running cluster, but you cannot remove them from a cluster because since it stores data you have an risk of losing data.
Task Node(s): As the name suggests these nodes run tasks and they map to equivalent of Hadoop slave node. These nodes are optional in nature. Task nodes are managed by the master node. While a cluster is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a cluster, you can use task nodes to manage the EC2 instance capacity your cluster, by increasing capacity to handle peak loads and decreasing it later when there is no load.

Analysis :
Imagine the log volume flow is not constant and it varies every hour, some hours you receive few hundred GB's and some hours few GB's of logs for processing. For peak hours your use case needs around 192 mappers/72 reducers and normal hours you need ~64 mappers/24 reducers or less. The peak and normal numbers can be arrived based on the analysis done on the past data. This elasticity in log volume scenario is a usual occurrence in many big data projects and it is source of cost leakage. Simple approach what many architects take is that they run their cluster infrastructure @ peak capacity always since the operation is time sensitive, but this might not be an optimal approach in amazon cloud-big data world. Since you can elastically increase/decrease the number of nodes in an Amazon EMR cluster it is optimal if you can size the number of nodes dynamically every hour. Since you pay by usage in amazon cloud, having this elasticity built in your architecture will save costs.

Based on the number of mappers/reducers required, we have chosen the node capacity to be in m1.xlarge EC2 units. So during Peak hours you will need 24 processing nodes and normal (avg) hours it will be reduced to 8 processing nodes.

Elastic Approach-1: Vary the Task Nodes:
In this approach, number of Master and Core nodes are maintained constant. 1 - master node and 4- core nodes are used for processing and data storage always. The task nodes are increased and decreased between 4->20 every hour depending upon the log volume flow. Since the data is present in the core nodes and only tasks/jobs are assigned in the task nodes, adding/removing task nodes will not cause problems. You can engineer a custom Job manager using AWS API's and manage this entire cluster easily. If you do a simple math that in average only 8 processing nodes are needed (4 core + 4 task nodes) and during peak hours you need ( 4 core + 20 task nodes) with this approach you can save ~60 % costs by not running your cluster in ALWAYS peak capacity. This model is a recommended approach for many elastic big data use cases in AWS. Refer the below table for cost savings:

Scenario	No. of Processing Nodes	Hourly rate	Node Type	Monthly
Peak hours	24	0.48	M1.Xlarge	~8570.88
Normal hours	8	0.48	M1.Xlarge	~2856.96

Elastic Approach-2: Vary both the Core and Task Nodes:
In this approach, the number of both Core and Task nodes are varied dynamically. Since the Core nodes can be only increased and cannot be decreased in a running cluster(because it could lead to data loss), this approach is recommended only for advanced use cases. Since the entire data is stored in S3(is reproducible) and can be moved to the EMR cluster every hour, using the custom Job manager an entire cluster can be created(even every hour) depending upon the log data volume (GB's). Example: Imagine first hour: 4 Core + 10 Task nodes are used for processing, second hour data volume is increased and 4 Core + 20 Task nodes are added in the cluster, third/fourth hour etc there is hardly few GB's of data flow and only 8 Mappers/3 reducers are needed for processing, instead of running 20 task + 4 core nodes(of prev hour), a new EMR cluster can be created with just 1 master and 1-2 Core nodes. This approach requires engineering a custom job manager using AWS API's for managing the cluster. Though this approach is little complex to engineer, it saves more cost than approach-1 on medium to long term.

Note:The approaches illustrated are not theoretical in nature. I have put both the above techniques to production use for some customers and they are already seeing huge cost savings.

Coming Soon - Adding Spot to this equation gives brutal savings ...

AWS Cost Saving Tip 12: Add Spot Instances with Amazon EMR

In continuation to my post on "How elastic thinking can save costs on Amazon EMR cluster ?" i have explored in this post how we can exploit Amazon EMR by introducing Spot EC2 into the cluster and achieve more cost savings.

Most of us know that Amazon Spot EC2 instances are usually good choice for Time-flexible and interruption-tolerant tasks. These instances gets traded frequently on a Spot market price and you can fix your Bid Price using AWS API's or AWS Console. Once free Spot EC2 instances are available for your Bid Price, AWS will allot them for use in your account. Spot instances are usually available way cheaper than On-Demand EC2 instances most of the times. Example: On-Demand m1.xlarge per hour price is 0.48 USD and on spot market you can find them sometimes @ 0.052 per hour. This is ~9 times cheaper than the on-demand price; imagine if you can bid competitively and get hold of spot EC2 even around 0.24 USD most of the times, you are saving 50% from the on-demand price straight away. In Big data use cases usually you might need lots of EC2 nodes for processing, adopting such techniques can vastly make difference in your infra cost and operations in long term. I am sharing my experience on this subject as tips and techniques you can adopt to save costs while using EMR clusters in Amazon for big data problems.
Note : While dealing with spot you can be sure that you will never pay more than your maximum bid price per hour.

To know more about real implementation of these tips, read the following case study. Lock, Stock and X Smoking EC2's.

Tip 1: Make right choice (Spot vs On-Demand) for the cluster components
Data Critical workloads: For workloads which cannot afford to lose data you can have the Master + Core on Amazon On-Demand EC2 and your task nodes on Spot EC2. This is the most common pattern while combining Spot and On-Demand on Amazon EMR cluster. Since task nodes are operating on spot prices depending upon your bidding strategy you can save ~50% costs from running your task nodes using On-Demand EC2. You can further save(if you are lucky) by reserving your Core and Master Nodes , but you will be tied to an AZ. According to me this is not a good or common technique, because some AZ's can be very noisy with high spot prices.
Cost Driven workloads: When solving big data problems, sometimes you might have to face scenarios where cost is very important than time. Example: You are processing archives of old logs as low priority jobs, where cost of processing is very important and usually with abundant time left. Such cases you can have all the Master+Core+Task run on Spot EC2 to get further savings from the data critical workloads approach. Since all the nodes are operating on spot prices depending upon your bidding strategy you can save ~60% or more costs from running your nodes using On-Demand EC2. The below mentioned table published by AWS gives an indication of the Amazon EMR + Spot combinations that are widely used:

Tip 2: There is free lunch sometimes
Spot Instances can be interrupted by AWS when the spot price reaches your bidding price. What interruption means is that, AWS can pull out the Spot EC2's assigned to your account when the price matches/exceeds. If your Spot Task Nodes are interrupted you will not be charged for any partial hour of usage by AWS i.e. if you have started the instance @ 10:05 am and if your instances are interrupted by spot price fluctuations @ 10:45 am you will not be charged for the partial hour of usage. If your processing exercise is totally time insensitive, you can keep your bidding price at closer level to spot price which are easily interrupt-able by AWS and exploit this partial hours concept. Theoretically you can get most of the processing done through your task nodes for free* exploiting this strategy.

Tip 3: Use the AZ wisely when it comes to spot
Different AZ's inside an Amazon EC2 region has different spot prices for the same Instance type. Observe this pattern for a while, build some intelligence around the price data collected and rebuild your cluster in the AZ with lowest price. Since the Master+Core+Task need to run on the same AZ for better latency, it is advisable to architect your EMR clusters in such a way they can be switched(i.e.recreate) to different AZ's according to spot prices. If you can build this flexibility in your architecture you can save costs by leveraging the Inter AZ price fluctuations. Refer the below images for Spot Price variations in 2 AZ's inside the same Region for same time period. "Make your choice wisely time to time"

Tip 4: Keep your Job logic small and store intermediate outputs in S3
Breakdown your complex processing logic into small jobs and design your jobs and tasks in EMR cluster in such a way that they run for very small period of time (example few minutes). Store all the intermediate job outputs in Amazon S3. This approach is helpful in EMR world and gives you following benefits:

When your Core+ Task nodes are interrupted frequently, you can still continue from the intermediate points. Data accessed from S3.
You now have the flexibility to recreate the EMR clusters in multiple AZ depending upon the Spot price fluctuations
You can decide the number of nodes needed for your EMR cluster(even every hour) depending upon the data volume, density and velocity

All the above 3 points when implemented contribute to elasticity in your architecture and there by helps you save costs in Amazon cloud. The above recommendation is not suitable for all Jobs, it has to be carefully mapped with right use cases by the architects.

To know more about real implementation of the above tips, read the following case study. Lock, Stock and X Smoking EC2's.

AWS Cost Saving Tip 11: How elastic thinking saves cost in Amazon EMR Clusters ?

Scenario	No. of Processing Nodes	Hourly rate	Node Type	Monthly
Peak hours	24	0.48	M1.Xlarge	~8570.88
Normal hours	8	0.48	M1.Xlarge	~2856.96

AWS Cost Saving Tip 11: How elastic thinking saves cost in Amazon EMR Clusters ?

AWS Cost Saving Tip 12: Add Spot Instances with Amazon EMR

AWS Cost Saving Tip 11: How elastic thinking saves cost in Amazon EMR Clusters ?

4 comments: