Import Hadoop Source Project Into Eclipse And Build Hadoop-2.2.0 On Mac OS X
The Official Building Document is
where I start. There you can find all the steps and cautions you need
to build binary version of Hadoop from source. The most essential URLs
are as follows:
Mac OS X-10.9.4
Protocol Buffer-2.5.0
After deploying all the items above, add relative environment variables to '~/.profile'.
Remember to make it valid by 'source ~/.profile' after editing. You can double-check by issuing the following command. If all the versions prints out normally, just move on!
Another prerequisite is required by BUILDING.txt, in which it says "A one-time manual step is required to enable building Hadoop OS X with Java 7 every time the JDK is updated":
Then we are going to git clone the hadoop source project to our local filesystem:
When it is done, we can check out all the remote branches in hadoop project by issuing 'git branch -r' in the root path of the project. Switch to branch '2.2.0' via 'git checkout branch-2.2.0'. Open pom.xml in the root path of the project so as to make sure it has changed to branch-2.2.0:
Still in the root path of the project, execute commands as below:
Possible Problem #1:
Possible Problem #2:
Finally, installing 'm2e' in Eclipse and import hadoop source project.
When project imported, you don't have to be surprised by so many errors in almost all sub-projects of hadoop (Well, at least for me, there are soooo many red crosses on my projects). The most common one is "Plugin execution not covered by lifecycle configuration ... Maven Project Build Lifecycle Mapping Problem", this is caused by the asynchronized development of m2e eclipse plugin and maven itself. By now, no good solutions to this problem has been found by me. If anyone have some better idea, please leave a message, big thanks! Anyway, we can still track, read and revise the source code in eclipse before building the project from command line.
Building hadoop is a lot more easy than I thought before. There is detailed instruction in BUILDING.txt, too. The most essential part is as follows:
As is said above in Native Libraries Guide for Hadoop-2.2.0, the native hadoop library is supported on *nix platforms only. The library does not to work with Cygwin or the Mac OS X platform. Consequently, we can build hadoop with command:
- Read-only version of hadoop source provided by Apache
- The latest version of BUILDING.txt (You can find the corresponding BUILDING.txt from the root path of the hadoop source project)
- Native Libraries Guide for Hadoop-2.2.0
- Working with Hadoop under Eclipse
Wednesday, October 29, 2014
InputFormat In Hive And The Way To Customize CombineHiveInputFormat
Part.1 InputFormat In Hive
There are two places where we can specify InputFormat in hive, when creating table and before executing HQL, respectively.
For the first case, we can specify InputFormat and OutputFormat when creating hive table, just like:
CREATE TABLE example_tbl ( id int, name string ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT '';
We could check out the specified InputFormat and OutputFormat for a table by:
hive> DESC FORMATTED example_tbl; ... # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: Compressed: No ...
In this case, the InputFormat and
OutputFormat is responsible for Storing data in as well as Retrieving
data out of HDFS. Thus, it is transparent to hive itself. For instance,
some text content is saved in binary format in HDFS, which is mapped to a
particular hive table. When we invoking a hive task on this table, it
will load the data via its InputFormat so as to get the 'decoded' text
content. After executing the HQL, the hive task will write the result to
whatever the destination is(HDFS, local file system, screen, etc.) via
its OutputFormat.
For the second case, we could set 'hive.input.format' before invoking a HQL:
hive> set; hive> select * from example_tbl where id > 10000;
If we set this parameter in hive-site.xml, it will be the default Hive InputFormat provided not setting 'hive.input.format' explicitly before the HQL.
The InputFormat in this scenario serves different function in comparison
to the former one. Firstly, let's take a glance at
'org.apache.hadoop.mapred.FileInputFormat', which is the base class for
all file-based InputFormat. There are three essential methods in this
boolean isSplitable(FileSystem fs, Path filename) InputSplit[] getSplits(JobConf job, int numSplits) RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter)
'isSplitable' is self-explaining: it will return whether the given
filename is splitable. This method is valid when working around
MapReduce program, when it comes to Hive-related one, we could set
'mapreduce.input.fileinputformat.split.minsize' in hive-site.xml to a
very big value to achieve the same effect alternatively.
'getSplits' will return an array of InputSplit objects, whose size is
corresponding to the number of mappers for this HQL task. Every
InputSplit contains one or more file chunks in current file system, the
details will be discussed later.
'getRecordReader' will return a 'org.apache.hadoop.mapred.RecordReader'
object, whose function is to read data record by record from underlying
file system. The main methods are as follows:
K createKey() V createValue() boolean next(K key, V value) float getProgress()
'createKey', 'createValue' and 'getProgress' is well self-explaining.
'next' will evaluate the key and value parameters from current read
position provided it returns true; when being at EOF, false is returned.
In the former case as mentioned above, only 'getRecordReader' method
will be used; Whereas in the latter case, only 'getSplits' method will
be used.
Part.2 Customize CombineHiveInputFormat
In my daily work, there's a need for me to rewrite
CombineHiveInputFormat class. Our data in HDFS is partitioned by
yyyyMMdd, in each partition, all files are named in pattern
/user/supertool/hiveTest/20140901/part-0 /user/supertool/hiveTest/20140901/part-1 /user/supertool/hiveTest/20140901/part-2 /user/supertool/hiveTest/20140902/part-0 /user/supertool/hiveTest/20140902/part-1 /user/supertool/hiveTest/20140902/part-2 /user/supertool/hiveTest/20140903/part-0 /user/supertool/hiveTest/20140903/part-1 /user/supertool/hiveTest/20140903/part-2
This experimental hive table is created by:
CREATE EXTERNAL table hive_combine_test (id string, rdm string) PARTITIONED BY (dateid string) row format delimited fields terminated by '\t' stored as textfile; ALTER TABLE hive_combine_test ADD PARTITION (dateid='20140901') location '/user/supertool/zhudi/hiveTest/20140901'; ALTER TABLE hive_combine_test ADD PARTITION (dateid='20140902') location '/user/supertool/zhudi/hiveTest/20140902'; ALTER TABLE hive_combine_test ADD PARTITION (dateid='20140903') location '/user/supertool/zhudi/hiveTest/20140903';
What we intend to do is to package all the files from different
partition with the same i into one InputSplit, so as to package them
into one mapper. Overall, there should be 64 mappers no matter how many
days(partitions) are involved in my HQL.
The way to customize CombineHiveInputFormat in eclipe is as follows:
In eclipse, File-->New-->Other-->Maven Project-->Create a simple project.
Revise pom.xml according to your own hadoop and hive version:
<dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>0.13.1</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-serde</artifactId> <version>0.13.1</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-common</artifactId> <version>0.13.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId></groupId> <artifactId></artifactId> <version>1.7.0_25</version> <scope>system</scope> <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath> </dependency> </dependencies>
At the same time, we should insert maven-assembly-plugin in pom.xml in order to package:
<build> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <!-- this is used for inheritance merges --> <phase>package</phase> <!-- bind to the packaging phase --> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
After all the peripheral settings, now we can just create a new class
derived from CombineHiveInputFormat. What we intend to do is to
reconstruct the array of InputSplit returned from
public class JudCombineHiveInputFormatOld<K extends WritableComparable, V extends Writable> extends CombineHiveInputFormat<WritableComparable, Writable> { @Override public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { InputSplit[] iss = super.getSplits(job, numSplits); //TODO: Reconstruct the iss to what we want. return null; } }
Consequently, it is time that we get some knowledge on InputSplit. In
CombineHiveInputFormat, the implementation class for InputSplit is
CombineHiveInputSplit, which contains a
implementation class. The constructor for
'org.apache.hadoop.hive.shims.HadoopShimsSecure.InputSplitShim' needs a
'org.apache.hadoop.mapred.lib.CombineFileSplit' object, whose
constructor is like:
CombineFileSplit(JobConf job, Path[] files, long[] start, long[] lengths, String[] locations)
Apparently, all parameters are corresponding to the InputSplit in
MapReduce, standing for JobConf info, file paths info, file start
positions, file chunk size, the hive cluster that all the files will be
sent to, respectively.
After getting familiar with the structure of InputSplit Class, we can
simply rearrange all the files in InputSplit according to the file name
Just one more thing: CombineHiveInputSplit has a field named
'inputFormatClassName', which is the name of InputFormat configured when
creating the hive table(In the former case as stated above). In the
process of executing a hive task, files may come from different source
with different InputFormat(Some come from hive table's source data, some
come from hive temporary data). Thus, InputFormatClassName should be
grouped when we rearrange InputSplit.
Here's a code snippet for reconstruction of CombineHiveInputFormat:
Path[] files = new Path[curSplitInfos.size()]; long[] starts = new long[curSplitInfos.size()]; long[] lengths = new long[curSplitInfos.size()]; for(int i = 0; i < curSplitInfos.size(); ++i) { SplitInfo si = curSplitInfos.get(i); files[i] = si.getFile(); starts[i] = si.getStart(); lengths[i] = si.getLength(); } String[] locations = new String[1]; locations[0] = slice2host.get(sliceid); org.apache.hadoop.mapred.lib.CombineFileSplit cfs = new org.apache.hadoop.mapred.lib.CombineFileSplit( job, files, starts, lengths, locations); org.apache.hadoop.hive.shims.HadoopShimsSecure.InputSplitShim iqo = new org.apache.hadoop.hive.shims.HadoopShimsSecure.InputSplitShim(cfs); CombineHiveInputSplit chis = new CombineHiveInputSplit(job, iqo); chis.setInputFormatClassName(curInputFormatClassName);
After implementing, we can simply issue mvn clean package -Dmaven.test.skip=true, then copy '*jar-with-dependencies*.jar' in project target folder to ($HIVE_HOME/lib in every hive clusters) as well as ($HADOOP_HOME/share/hadoop/common/lib in every hive clusters).
At last, we can set hive.input.format to our own version by 'set hive.input.format=com.judking.hive.inputformat.JudCombineHiveInputFormat;' before invoking a HQL.
If debugging is needed, we can System.out in our InputFormat class, in
which way the info will be printed to screen. Alternatively, we can use
'LoggerFactory.getLog()' to retrieve a Log object, the content will
output to '/tmp/(current_user)/hive.log'.
Tuesday, October 28, 2014
Find And Replace Specific Keyword In All Files From A Directory Recursively
If we intend to replace keyword '<abc>' with '<def>' in all
*.java files from $ROOT_PATH, we can simply achieve this by:
The first part "find . -name "*.java" -print" will print all the relative paths of files which matches the pattern "*.java", just like:
Then all the output lines will be passed to sed command via xargs, thus the equivalent of the latter part is as below:
'-i' stands for:
which will output the replaced file to a file.
But here's a tricky part. As mentioned above, the '-i' is optional on Ubuntu, whereas it is kind of mandatory to give '-i' a value in Mac osx, or some error like 'invalid command code .' could be thrown. Consequently, it is recommend that we add '-i ""' to our command, although it's a little bit cumbersome :)
Saturday, October 25, 2014
How To Set The Queue Where A MapReduce Task Or Hive Task To Run
There is always a need for us to specify the queue for our MR or hive task. Here's the way:
An example for MapReduce task:
For Hive Task, inserting the following code before invoking the real HQL task:
To generalize it, we can safely conclude that most of Hadoop or Hive configurations can be set in the upper forms respectively. What the 'most' means here is that some configurations cannot be revised during runtime, or being stated as 'final'.
An example for MapReduce task:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 10000
For Hive Task, inserting the following code before invoking the real HQL task:
To generalize it, we can safely conclude that most of Hadoop or Hive configurations can be set in the upper forms respectively. What the 'most' means here is that some configurations cannot be revised during runtime, or being stated as 'final'.
Friday, October 24, 2014
Linux: Kill Processes With Specific Keyword
When we are going to kill a process with specific keyword, we can simply
find the PID(process id) by command as follows (PID is always shown in
the second column of the result):
Then the process can be forcibly killed by:
This method works fine when there is only one or a few processes to be terminated. As the number of process grow larger and larger, it is too painful to do it manually. Consequently, we have to deal with it in some other way.
Note: All the solutions below will kill processes which is listed on the screen via command 'ps aux | grep "keyword"'.
Solution 1:
Solution 2:
Solution 3:
Solution 4:
Pick one that you're most comfortable with :)
VCore Configuration In Hadoop
Just like memory, vcores, the abbreviation for virtual cores, is
another type of resource in Hadoop cluster. It is the abstraction of the
ability of CPU.
We can see from class 'org.apache.hadoop.yarn.api.records.Resource', which is the abstraction of resource in YARN, that memory and vcores are equally treated in Hadoop:
Here are all the relative params for vcores in Hadoop configuration files:
The first three params are configured in mapred-site.xml, the rest are in yarn-site.xml. and mapreduce.reduce.cpu.vcores are easy to understand, which represents the number of vcores providing to each map or reduce task respectively. stands for the number of vcores for MapReduce Application Master Node. (MR ApplicationMaster: A new framework for MR on YARN, which cooperates with NodeManager with the resource retrieved from ResourceManager)
The last three params is well-explained in the description section. As for yarn.nodemanager.resource.cpu-vcores, it should be set to the number of processors in a single node in most cases (One virtual core should correspond to one physical processor). At the same time, it is recommended that yarn.scheduler.maximum-allocation-vcores be set no more than yarn.nodemanager.resource.cpu-vcores.
For instance, if a single node has 24 processors, then an appropriate set of params can be set as follows:
Relative Posts:
· Memory Configuration In Hadoop
Relative Posts:
· Memory Configuration In Hadoop
Thursday, October 23, 2014
Hadoop_Troubleshooting: fair-scheduler.xml Does Not Take Effect After Revising
When using Fair Scheduler in YARN, we don't need to restart Hadoop cluster when fair-scheduler.xml is altered, as stated in official document:
The Fair Scheduler contains configuration in two places -- algorithm parameters are set in HADOOP_CONF_DIR/mapred-site.xml, while a separate XML file called the allocation file, located by default in HADOOP_CONF_DIR/fair-scheduler.xml, is used to configure pools, minimum shares, running job limits and preemption timeouts. The allocation file is reloaded periodically at runtime, allowing you to change pool settings without restarting your Hadoop cluster.
However, there are times when the changes in fair-scheduler.xml doesn't come into effect and we have no idea what's going wrong, here's the way to find it out!
Firstly, go into directory '$HADOOP_HOME/logs'.
Then open file 'yarn-hadoop-resourcemanager-*.log'.
In which, find 'ERROR' from bottom, looking for records like (especially the bold red part):
From above, we can see the exception is thrown at line 460 in class 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager'.
Tracking to the source code, we can find the corresponding exception content:
Wednesday, October 22, 2014
Keep the Original Format Of Text When Pasting to Vim (A.K.A. Turning Off Auto-Indent Feature Of Vim)
Sometimes when we paste some text to vim, we are more likely to retain
the format of the text. But by default, vim will add indent to our text
automatically, just like:

Solution 1:
Vim provides the ‘paste’ option to maintain the pasting text unmodified:
Solution 2:
Alternatively, vim offers the ‘pastetoggle’ option to turn ‘paste’ on and off by pressing a key. What we need to do is appending the following code in ~/.vimrc. Then we can just press key 'F2' to switch between ‘paste on’ and ‘paste off'.
Generate Public Key From A Private Key
When we've got the private key, say id_rsa, we can regenerate the public key by:
It is well-explained in ‘man ssh-keygen’:
Hadoop_Troubleshooting: Removing queues from fair-scheduler.xml will not take effect in the YARN monitoring webpage.
In fair-scheduler.xml, there's a queue named "test_queue", whose configuration is as below:
After I deleting the settings, this queue is not removed from the YARN monitoring webpage, even though all the parameters(Min Resources, Max Resources, Fair Share) under "test_queue" is blank. I'm sure that the fair-scheduler.xml is reloaded correctly.
Then I check it up with command as follows, the queue state is running, just like all the other queues.
Curiously enough, I tested whether I can set this queue again to run my hive task, and it STILL CAN!

As you can see, the MaxResources of this queue bulges to 100% of total resource after deleting it from fair-scheduler.xml, anyone can "escape" the ACL and use the resources arbitrarily. Consequently, attention should be paid to this scenario.
The solution is to restart yarn service( =>, at the price of interfering all the ongoing tasks to fail. (If anyone have a better solution, please FYI by leaving a message!)
Tuesday, October 21, 2014
Hadoop_Troubleshooting: 'AcpSubmitApps' In fair-scheduler.xml is Not Working.
After configuring "AclSubmitApps"
for a specific queue in fair-scheduler.xml, I can still submit a hive
task by user who is not in "AclSubmitApps" list, which is not expected
according to the official document.
The configuration of my testmonitor queue:
The monitoring status for my hive task:
The solution is simple and easy: Add "aclAdministerApps" to the queue correspondingly:
Then we can check the acl settings of queues via hadoop queue -showacls. In this time, the acl of queue `root.testmonitor` have neither SUBMIT_APPLICATIONS nor ADMINISTER_QUEUE.
When I submit a task in user monitor, The following exception is thrown as expected:
Honestly, I don't know exactly why we have to add "aclAdministerApps" in order to make it work, the official document says nothing about it, either. If anyone knows the essential reason to this solution, please leave a message, I'd really appreciate it :)
Honestly, I don't know exactly why we have to add "aclAdministerApps" in order to make it work, the official document says nothing about it, either. If anyone knows the essential reason to this solution, please leave a message, I'd really appreciate it :)
Hadoop_Troubleshooting: Job hangs at "map 0% reduce 0%" with logs "Reduce slow start threshold not met"
When I submitting an example hadoop task as below:
After google and experiment, It is more likely that hadoop job hangs at "Reduce slow start threshold not met" when there is not enough resource, like memory or vcore.
In my case, I rechecked $HADOOP_HOME/etc/hadoop/fair-scheduler.xml, and found that the vcores in root.supertool queue was accidentally set to zero:
When I set vcores back to normal, the stuck job just continues to go.
P.S. Besides the condition above, unreasonable memory or vcore configuration can also lead to this scenario. Please visit Memory Configuration in Hadoop and VCore Configuration in Hadoop for more reference.
P.S.Again After I installed hadoop on my Mac and ran "hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 3 100000000", I found it hanged at "Reduce slow start threshold not met. completedMapsForReduceSlowstart 1" again. I can assure that I've configured memory and vcores as instructed in my links above. Then I found that the maxResources for queue root.test in fair-scheduler.xml is set to 500MB, but the 'yarn.scheduler.minimum-allocation-mb', '' and 'mapreduce.reduce.memory.mb' is all above 500MB, that is to say, not even a single mapper or reducer can be allocated in this queue. Consequently, we should be aware that the maxResources for a specific queue should be greater than all the three parameters above.
Memory Configuration In Hadoop
In this post, there are some recommendations on how to configure YARN
and MapReduce memory allocation settings based on the node hardware
YARN takes into account all of the available compute resources on each
machine in the cluster. Based on the available resources, YARN
negotiates resource requests from applications (such as MapReduce)
running in the cluster. YARN then provides processing capacity to each
application by allocating Containers. A Container is the basic unit of
processing capacity in YARN, and is an encapsulation of resource
elements (memory, cpu etc.).
In a Hadoop cluster, it is vital to balance the usage of memory (RAM),
processors (CPU cores) and disks so that processing is not constrained
by any one of these cluster resources. As a general recommendation,
allowing for two Containers per disk and per core gives the best balance
for cluster utilization.
When determining the appropriate YARN and MapReduce memory
configurations for a cluster node, start with the available hardware
resources. Specifically, note the following values on each node:
- RAM (Amount of memory)
- CORES (Number of CPU cores)
- DISKS (Number of disks)
The total available RAM for YARN and MapReduce should take into account
the Reserved Memory. Reserved Memory is the RAM needed by system
processes and other Hadoop processes (such as HBase).
Reserved Memory = Reserved for stack memory + Reserved for HBase Memory (If HBase is on the same node)
Use the following table to determine the Reserved Memory per node.
Reserved Memory Recommendations
Total Memory per Node | Recommended Reserved System Memory | Recommended Reserved HBase Memory |
4 GB | 1 GB | 1 GB |
8 GB | 2 GB | 1 GB |
16 GB | 2 GB | 2 GB |
24 GB | 4 GB | 4 GB |
48 GB | 6 GB | 8 GB |
64 GB | 8 GB | 8 GB |
72 GB | 8 GB | 8 GB |
96 GB | 12 GB | 16 GB |
128 GB | 24 GB | 24 GB |
256 GB | 32 GB | 32 GB |
512 GB | 64 GB | 64 GB |
The next calculation is to determine the maximum number of containers allowed per node. The following formula can be used:
# of containers = min (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)
Where MIN_CONTAINER_SIZE is the minimum container size (in RAM). This
value is dependent on the amount of RAM available -- in smaller memory
nodes, the minimum container size should also be smaller. The following
table outlines the recommended values:
Total RAM per Node | Recommended Minimum Container Size |
Less than 4 GB | 256 MB |
Between 4 GB and 8 GB | 512 MB |
Between 8 GB and 24 GB | 1024 MB |
Above 24 GB | 2048 MB |
The final calculation is to determine the amount of RAM per container:
RAM-per-container = max(MIN_CONTAINER_SIZE, (Total Available RAM) / containers))
With these calculations, the YARN and MapReduce configurations can be set:
Configuration File | Configuration Setting | Value Calculation |
yarn-site.xml | yarn.nodemanager.resource.memory-mb | = containers * RAM-per-container |
yarn-site.xml | yarn.scheduler.minimum-allocation-mb | = RAM-per-container |
yarn-site.xml | yarn.scheduler.maximum-allocation-mb | = containers * RAM-per-container |
mapred-site.xml | | = RAM-per-container |
mapred-site.xml | mapreduce.reduce.memory.mb | = 2 * RAM-per-container |
mapred-site.xml | | = 0.8 * RAM-per-container |
mapred-site.xml | | = 0.8 * 2 * RAM-per-container |
yarn-site.xml (check) | | = 2 * RAM-per-container |
yarn-site.xml (check) | | = 0.8 * 2 * RAM-per-container |
Note: After installation, both
and mapred-site.xml
are located in the /etc/hadoop/conf
Cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks.
Reserved Memory = 6 GB reserved for system memory + (if HBase) 8 GB for HBase
Min container size = 2 GB
If there is no HBase:
# of containers = min (2*12, 1.8* 12, (48-6)/2) = min (24, 21.6, 21) = 21
RAM-per-container = max (2, (48-6)/21) = max (2, 2) = 2
Configuration | Value Calculation |
yarn.nodemanager.resource.memory-mb | = 21 * 2 = 42*1024 MB |
yarn.scheduler.minimum-allocation-mb | = 2*1024 MB |
yarn.scheduler.maximum-allocation-mb | = 21 * 2 = 42*1024 MB | | = 2*1024 MB |
mapreduce.reduce.memory.mb | = 2 * 2 = 4*1024 MB | | = 0.8 * 2 = 1.6*1024 MB | | = 0.8 * 2 * 2 = 3.2*1024 MB | | = 2 * 2 = 4*1024 MB | | = 0.8 * 2 * 2 = 3.2*1024 MB |
If HBase is included:
# of containers = min (2*12, 1.8* 12, (48-6-8)/2) = min (24, 21.6, 17) = 17
RAM-per-container = max (2, (48-6-8)/17) = max (2, 2) = 2
Configuration | Value Calculation |
yarn.nodemanager.resource.memory-mb | = 17 * 2 = 34*1024 MB |
yarn.scheduler.minimum-allocation-mb | = 2*1024 MB |
yarn.scheduler.maximum-allocation-mb | = 17 * 2 = 34*1024 MB | | = 2*1024 MB |
mapreduce.reduce.memory.mb | = 2 * 2 = 4*1024 MB | | = 0.8 * 2 = 1.6*1024 MB | | = 0.8 * 2 * 2 = 3.2*1024 MB | | = 2 * 2 = 4*1024 MB | | = 0.8 * 2 * 2 = 3.2*1024 MB |
