Monday, 29 December 2014

Error

14/12/29 16:32:45 INFO mapreduce.Job: Job job_1419844357101_0002 failed with state

KILLED due to: MAP capability required is more than the supported max container

capability in the cluster. Killing the Job. mapResourceReqt: 4096

maxContainerCapability:1848
Job received Kill while in RUNNING state.
REDUCE capability required is more than the supported max container capability in

the cluster.Killing the Job. reduceResourceReqt: 4096 maxContainerCapability:1848
14/12/29 16:32:45 INFO mapreduce.Job: Counters: 2
 Job Counters 
  Total time spent by all maps in occupied slots (ms)=0
  Total time spent by all reduces in occupied slots (ms)=0
14/12/29 16:32:45 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 1

/usr/lib/rstudio-server/bin/rserver: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory

/usr/lib/rstudio-server/bin/rserver: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
Solution:
rpm -ivh --nodeps rstudio-server-0.98.1049-x86_64.rpm
More than that, after installation and start I got errors:
/usr/lib/rstudio-server/bin/rserver: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
/usr/lib/rstudio-server/bin/rserver: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory
So:
ln -s /usr/lib64/libssl.so.10 /usr/lib64/libssl.so.6
ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6
At that RStudio server seems to work.
Additionally:
yum gives
** Found 5 pre-existing rpmdb problem(s), 'yum check' output follows:
rstudio-server-0.98.1049-1.x8664 has missing requires of libcrypto.so.6()(64bit)
rstudio-server-0.98.1049-1.x8664 has missing requires of libffi.so.5()(64bit)
rstudio-server-0.98.1049-1.x8664 has missing requires of libgmp.so.3()(64bit)
rstudio-server-0.98.1049-1.x8664 has missing requires of libssl.so.6()(64bit)
rstudio-server-0.98.1049-1.x86_64 has missing requires of psmisc
To solve the last line:
yum install psmisc

Thursday, 25 December 2014

How to configure queues using YARN capacity-scheduler.xml ?

In this article, we would go through important aspects involved while setting up queues using YARN Capacity Scheduler.
Before, we setup the queue, let's first see how we could configure the amount of maximum memory to be utilized by YARN node managers. In order to configure a PHD cluster to utilize a specific amount of memory for YARN node managers, we could edit the parameter "yarn.nodemanager.resource.memory-mb" in yarn configuration file "/etc/gphd/hadoop/conf/yarn-site.xml". After a desired value has been defined, it needs a restart of YARN services for the change to take place.

yarn-site.xml
 <property>
     <name>yarn.nodemanager.resource.memory-mb</name>
     <value>16384</value>
 </property>

In the above example, we have assigned 16 GB of memory for utilization by YARN node managers per server.
We may now define multiple queues depending on the requirement for the operations which needs to be performed and give them a share of the cluster resources defined. However, in order to allow YARN to use capacity scheduler, we need to have parameter "yarn.resourcemanager.scheduler.class" defined in yarn-site.xml to use CapacityScheduler. In PHD, by default, the value is set to use CapacityScheduler, so we could straight away define the queues.

 yarn-site.xml
 <property>
     <name>yarn.resourcemanager.scheduler.class</name>
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
 </property>

Setting up the queues:
CapacityScheduler has a pre-defined queue called root. All queueus in the system are children of the root queue. In capacity-scheduler.xml, parameter "yarn.scheduler.capacity.root.queues" could be used to define child queues. For example, In order to create 3 queues, we could specify the name of the queues in a comma separated list.

<property>
     <name>yarn.scheduler.capacity.root.queues</name>
     <value>alpha,beta,default</value>
     <description>
       The queues at the this level (root is the root queue).
     </description>
 </property>

With the above change being done, now we could proceed further to specify the queue specific parameters. Parameters denoting queue specific properties follow a standard set of naming convention & they include the name of the queue for which they are relevant.
Here is an example of general syntax: yarn.scheduler.capacity.<queue-path>.<parameter>
where :
<queue-path> : identifies the name of the queue.
<parameter> : identifies the parameter whose value is being set.

Please refer to Apache Yarn Capacity Scheduler documentation for the complete list of configurable parameter.
Link: http://hadoop.apache.org/docs/r2.0.5-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

Let's move on to set some major parameters
1) Queue Resource Allocation related queue parameter:
a) yarn.scheduler.capacity.<queue-path>.capacity
To set the percentage of cluster resource which must be allocated to these resources, we need to edit the value of the parameter yarn.scheduler.capacity.<queue-path>.capacity in capacity-scheduler.xml accordingly. In the below example, we set the queues to use 50%, 30% and 20% of the allocated cluster resources which was earlier set by "yarn.nodemanager.resource.memory-mb" per nodemanager.

Example below:

 <property>
     <name>yarn.scheduler.capacity.root.alpha.capacity</name>
     <value>50</value>
     <description>Default queue target capacity.</description>
   </property>
 
 <property>
     <name>yarn.scheduler.capacity.root.beta.capacity</name>
     <value>30</value>
     <description>Default queue target capacity.</description>
   </property>
 
 <property>
     <name>yarn.scheduler.capacity.root.default.capacity</name>
     <value>20</value>
     <description>Default queue target capacity.</description>
   </property>

There are few other parameters as well which defines resource allocation and could be set, you could refer them on the Apache Yarn Capacity Scheduler documentation.

2) Queue Administration & Permissions related parameter:
a) yarn.scheduler.capacity.<queue-path>.state
To enable the queue to allow jobs / application to be submitted via them, the state of queue must be RUNNING, else you may receive error message stating that queue is STOPPED. RUNNING & STOPPED are the permissible values for this parameter.

Example below:

  <property>
     <name>yarn.scheduler.capacity.root.alpha.state</name>
     <value>RUNNING</value>
     <description>
       The state of the default queue. State can be one of RUNNING or STOPPED.
     </description>
   </property>
 
 <property>
    <name>yarn.scheduler.capacity.root.beta.state</name>
     <value>RUNNING</value>
     <description>
       The state of the default queue. State can be one of RUNNING or STOPPED.
     </description>
   </property>
 
 <property>
     <name>yarn.scheduler.capacity.root.default.state</name>
     <value>RUNNING</value>
     <description>
       The state of the default queue. State can be one of RUNNING or STOPPED.
     </description>
   </property>

b) yarn.scheduler.capacity.root.<queue-path>.acl_submit_applications
To enable a particular user to submit a job / application to a specific queue, we must define the username / group in a comma separated list. A special value of * allows all the users to submit jobs / application to the queue.

Example format for specifying the list of users:

1) <value>user1,user2</value> : This indicates that user1 and user2 are allowed.
2) <value>user1,user2 group1,group2</value> : This indicates that user1, user2 and all the users from group1 & group2 are allowed.
3) <value>group1,group2</value>: This indicates that all the users from group1 & group2 are allowed.

Under this parameter, first thing you must define is the value for the parameter as "hadoop,yarn,mapped,hdfs" for non-leaf root queue, it ensures that only the special users could use all the queues. Since the child queues inherit permissions of their root queue, and by default its "*", thus if you don't restrict the list at root queue, all the user may still be able to run jobs on any of the queues. By specifying "hadoop,yarn,mapped,hdfs" for non-leaf root queue, you could control user access based on specific child queues.
Non-Leaf Root queue :
Example below:

 <property>
    <name>yarn.scheduler.capacity.root.acl_submit_applications</name>
     <value>hadoop,yarn,mapred,hdfs</value>
     <description>
       The ACL of who can submit jobs to the root queue.
     </description>
   </property>

Child Queue under root queue / Leaf child queue:
Example below:

<property>
   <name>yarn.scheduler.capacity.root.alpha.acl_submit_applications</name>
   <value>sap_user hadoopusers</value>
     <description>
      The ACL of who can submit jobs to the alpha queue.
     </description>
   </property>
 
 <property>
     <name>yarn.scheduler.capacity.root.beta.acl_submit_applications</name>
     <value>bi_user,etl_user failgroup</value>
     <description>
       The ACL of who can submit jobs to the beta queue.
     </description>
   </property>
 
   <property>
     <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
     <value>adhoc_user hadoopusers</value>
     <description>
       The ACL of who can submit jobs to the default queue.
     </description>
   </property>

c) yarn.scheduler.capacity.<queue-path>.acl_administer_queue
To set the list of administrator who could manage an application on a queue, you may set the username in a comma separated list for this parameter. A special value of * allows all the users to administrator an application running on a queue.
You may define the below properties as we defined for acl_submit_applications. Same syntax is followed.

Example below:

 <property>
     <name>yarn.scheduler.capacity.root.alpha.acl_administer_queue</name>
     <value>sap_user</value>
     <description>
       The ACL of who can administer jobs on the default queue.
     </description>
   </property>
 
  <property>
     <name>yarn.scheduler.capacity.root.beta.acl_administer_queue</name>
     <value>bi_user,etl_user</value>
     <description>
       The ACL of who can administer jobs on the default queue.
     </description>
   </property>
 
  <property>
     <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
     <value>adhoc_user</value>
     <description>
       The ACL of who can administer jobs on the default queue.
     </description>
   </property>

3) There are "Running and Pending Application Limits" related other queue parameters, which could also be defined but we have not covered in this article.
Bringing the queues in effect:
Once you have defined the required parameters in capacity-scheduler.xml file, now run the below command to bring the changes in effect.
yarn rmadmin -refreshQueues

After successful completion of the above command, you may verify if the queues are setup using below 2 options:

1) hadoop queue -list

[root@phd11-nn ~]# hadoop queue -list
 DEPRECATED: Use of this script to execute mapred command is deprecated.
 Instead use the mapred command for it.
 
 14/01/16 22:10:25 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
 14/01/16 22:10:25 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
 ======================
 Queue Name : alpha
 Queue State : running
 Scheduling Info : Capacity: 50.0, MaximumCapacity: 1.0, CurrentCapacity: 0.0
 ======================
 Queue Name : beta
 Queue State : running
 Scheduling Info : Capacity: 30.0, MaximumCapacity: 1.0, CurrentCapacity: 0.0
 ======================
 Queue Name : default
 Queue State : running
 Scheduling Info : Capacity: 20.0, MaximumCapacity: 1.0, CurrentCapacity: 0.0

2) By opening YARN resourcemanager GUI, and navigating to the scheduler tab. Link for resroucemanager GUI is : http://<Resouremanager-hostname>:8088.
where 8088 is the default port & replace <Resouremanager-hostname> with the hostname as per your PHD cluster. Below is an example for the same depicting one of the queue created "alpha"

Get ready to execute a job by submitting it to a specific queue:
Before, you execute any hadoop job, use the below command to identify the queue names on which you could submit your jobs.

[fail_user@phd11-nn ~]$ id
 uid=507(fail_user) gid=507(failgroup) groups=507(failgroup)
 
 [fail_user@phd11-nn ~]$ hadoop queue -showacls
 Queue acls for user :  fail_user
 
 Queue  Operations
 =====================
 root  ADMINISTER_QUEUE
 alpha  ADMINISTER_QUEUE
 beta  ADMINISTER_QUEUE,SUBMIT_APPLICATIONS
 default  ADMINISTER_QUEUE

If you see the above output, fail_user could submit application only on beta queue, since its part of "failgroup" and have been assigned only to beta queue in capacity-scheduler.xml as described earlier.

Let's move on a bit closer to running our first job in this article. In order to submit an application, you have to use the parameter -Dmapred.job.queue.name=<queue-name> or -Dmapred.job.queuename=<queue-name>

The below examples illustrates how to run a job on a specific queue.

[fail_user@phd11-nn ~]$ yarn jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.5-alpha-gphd-2.1.1.0.jar wordcount -D mapreduce.job.queuename=beta /tmp/test_input /user/fail_user/test_output
14/01/17 23:15:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
 14/01/17 23:15:31 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
 14/01/17 23:15:31 INFO input.FileInputFormat: Total input paths to process : 1
 14/01/17 23:15:31 INFO mapreduce.JobSubmitter: number of splits:1
 In DefaultPathResolver.java. Path = hdfs://phda2/user/fail_user/test_output
 14/01/17 23:15:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1390019915506_0001
 14/01/17 23:15:33 INFO client.YarnClientImpl: Submitted application application_1390019915506_0001 to ResourceManager at phd11-nn.saturn.local/10.110.127.195:8032
 14/01/17 23:15:33 INFO mapreduce.Job: The url to track the job: http://phd11-nn.saturn.local:8088/proxy/application_1390019915506_0001/
 14/01/17 23:15:33 INFO mapreduce.Job: Running job: job_1390019915506_0001
 2014-01-17T23:15:40.702-0800: 11.670: [GC2014-01-17T23:15:40.702-0800: 11.670: [ParNew: 272640K->18064K(306688K), 0.0653230 secs] 272640K->18064K(989952K), 0.0654490 secs] [Times: user=0.06 sys=0.04, real=0.06 secs]
 14/01/17 23:15:41 INFO mapreduce.Job: Job job_1390019915506_0001 running in uber mode : false
 14/01/17 23:15:41 INFO mapreduce.Job:  map 0% reduce 0%
 14/01/17 23:15:51 INFO mapreduce.Job:  map 100% reduce 0%
 14/01/17 23:15:58 INFO mapreduce.Job:  map 100% reduce 100%
 14/01/17 23:15:58 INFO mapreduce.Job: Job job_1390019915506_0001 completed successfully

While the job is executing, you may also monitor resource manger GUI to see on what queue is the job submitted. Here is a snapshot of the name. In the snapshot below, green color indicates the queue which is being used by the above word count application.

Now, let's see what happens when another queue is used on which fail_user is not allowed to submit applications. This must fail.

 [fail_user@phd11-nn ~]$ yarn jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.5-alpha-gphd-2.1.1.0.jar wordcount -D mapreduce.job.queuename=alpha /tmp/test_input /user/fail_user/test_output_alpha
14/01/17 23:20:07 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
 14/01/17 23:20:07 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
 14/01/17 23:20:07 INFO input.FileInputFormat: Total input paths to process : 1
 14/01/17 23:20:07 INFO mapreduce.JobSubmitter: number of splits:1
 In DefaultPathResolver.java. Path = hdfs://phda2/user/fail_user/test_output_alpha
 14/01/17 23:20:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1390019915506_0002
 14/01/17 23:20:08 INFO client.YarnClientImpl: Submitted application application_1390019915506_0002 to ResourceManager at phd11-nn.saturn.local/10.110.127.195:8032
 14/01/17 23:20:08 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/fail_user/.staging/job_1390019915506_0002
 14/01/17 23:20:08 ERROR security.UserGroupInformation: PriviledgedActionException as:fail_user (auth:SIMPLE) cause:java.io.IOException: Failed to run job : org.apache.hadoop.security.AccessControlException: User fail_user cannot submit applications to queue root.alpha
 java.io.IOException: Failed to run job : org.apache.hadoop.security.AccessControlException: User fail_user cannot submit applications to queue root.alpha
      at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:307)
      at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:395)
      at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218)
      at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236)
      at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
      at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Comments

Please sign in to leave a comment.

Hadoop Wiki

Hadoop FAQ

Contents

1. General

1.1. What is Hadoop?

Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce. For some details, see HadoopMapReduce.

1.2. What platforms and Java versions does Hadoop run on?

Java 1.6.x or higher, preferably from Sun -see HadoopJavaVersions
Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work. (Windows requires the installation of Cygwin).

1.3. How well does Hadoop scale?

Hadoop has been demonstrated on clusters of up to 4000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these non-default configuration values:

dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072

Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The updates to the above configuration being:

mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m

1.4. What kind of hardware scales best for Hadoop?

The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC memory, depending upon workflow needs. Machines should be moderately high-end commodity machines to be most cost-effective and typically cost 1/2 - 2/3 the cost of normal production application servers but are not desktop-class machines. This cost tends to be $2-5K. For a more detailed discussion, see MachineScaling page.

1.5. I have a new node I want to add to a running Hadoop cluster; how do I start services on just one node?

This also applies to the case where a machine has crashed and rebooted, etc, and you need to get it to rejoin the cluster. You do not need to shutdown and/or restart the entire cluster in this case.

First, add the new node's DNS name to the conf/slaves file on the master node.

Then log in to the new slave node and execute:

$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker

If you are using the dfs.include/mapred.include functionality, you will need to additionally add the node to the dfs.include/mapred.include file, then issue hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes so that the NameNode and JobTracker know of the additional node that has been added.

1.6. Is there an easy way to see the status and health of a cluster?

There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. By default, these are located at http://job.tracker.addr:50030/ and http://name.node.addr:50070/.

The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.

You can also see some basic HDFS cluster health data by running:

$ bin/hadoop dfsadmin -report

1.7. How much network bandwidth might I need between racks in a medium size (40-80 node) Hadoop cluster?

The true answer depends on the types of jobs you're running. As a back of the envelope calculation one might figure something like this:

60 nodes total on 2 racks = 30 nodes per rack Each node might process about 100MB/sec of data In the case of a sort job where the intermediate data is the same size as the input data, that means each node needs to shuffle 100MB/sec of data In aggregate, each rack is then producing about 3GB/sec of data However, given even reducer spread across the racks, each rack will need to send 1.5GB/sec to reducers running on the other rack. Since the connection is full duplex, that means you need 1.5GB/sec of bisection bandwidth for this theoretical job. So that's 12Gbps.

However, the above calculations are probably somewhat of an upper bound. A large number of jobs have significant data reduction during the map phase, either by some kind of filtering/selection going on in the Mapper itself, or by good usage of Combiners. Additionally, intermediate data compression can cut the intermediate data transfer by a significant factor. Lastly, although your disks can probably provide 100MB sustained throughput, it's rare to see a MR job which can sustain disk speed IO through the entire pipeline. So, I'd say my estimate is at least a factor of 2 too high.

So, the simple answer is that 4-6Gbps is most likely just fine for most practical jobs. If you want to be extra safe, many inexpensive switches can operate in a "stacked" configuration where the bandwidth between them is essentially backplane speed. That should scale you to 96 nodes with plenty of headroom. Many inexpensive gigabit switches also have one or two 10GigE ports which can be used effectively to connect to each other or to a 10GE core.

1.8. How can I help to make Hadoop better?

If you have trouble figuring how to use Hadoop, then, once you've figured something out (perhaps with the help of the mailing lists), pass that knowledge on to others by adding something to this wiki.

If you find something that you wish were done better, and know how to fix it, read HowToContribute, and contribute a patch.

1.9. I am seeing connection refused in the logs. How do I troubleshoot this?

See ConnectionRefused

1.10. Why is the 'hadoop.tmp.dir' config default user.name dependent?

We need a directory that a user can write and also not to interfere with other users. If we didn't include the username, then different users would share the same tmp directory. This can cause authorization problems, if folks' default umask doesn't permit write by others. It can also result in folks stomping on each other, when they're, e.g., playing with HDFS and re-format their filesystem.

1.11. Does Hadoop require SSH?

Hadoop provided scripts (e.g., start-mapred.sh and start-dfs.sh) use ssh in order to start and stop the various daemons and some other utilities. The Hadoop framework in itself does not require ssh. Daemons (e.g. TaskTracker and DataNode) can also be started manually on each node without the script's help.

1.12. What mailing lists are available for more help?

A description of all the mailing lists are on the http://hadoop.apache.org/mailing_lists.html page. In general:

general is for people interested in the administrivia of Hadoop (e.g., new release discussion).
user@hadoop.apache.org is for people using the various components of the framework.
-dev mailing lists are for people who are changing the source code of the framework. For example, if you are implementing a new file system and want to know about the FileSystem API, hdfs-dev would be the appropriate mailing list.

1.13. What does "NFS: Cannot create lock on (some dir)" mean?

This actually is not a problem with Hadoop, but represents a problem with the setup of the environment it is operating.

Usually, this error means that the NFS server to which the process is writing does not support file system locks. NFS prior to v4 requires a locking service daemon to run (typically rpc.lockd) in order to provide this functionality. NFSv4 has file system locks built into the protocol.

In some (rarer) instances, it might represent a problem with certain Linux kernels that did not implement the flock() system call properly.

It is highly recommended that the only NFS connection in a Hadoop setup be the place where the NameNode writes a secondary or tertiary copy of the fsimage and edits log. All other users of NFS are not recommended for optimal performance.

2. MapReduce

2.1. Do I have to write my job in Java?

No. There are several ways to incorporate non-Java code.

HadoopStreaming permits any shell command to be used as a map or reduce function.
libhdfs, a JNI-based C API for talking to hdfs (only).
Hadoop Pipes, a SWIG-compatible C++ API (non-JNI) to write map-reduce jobs.

2.2. How do I submit extra content (jars, static files, etc) for my job to use during runtime?

The distributed cache feature is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.

For streaming, see the HadoopStreaming wiki for more information.

Copying content into lib is not recommended and highly discouraged. Changes in that directory will require Hadoop services to be restarted.

2.3. How do I get my MapReduce Java Program to read the Cluster's set configuration and not just defaults?

The configuration property files ({core|mapred|hdfs}-site.xml) that are available in the various conf/ directories of your Hadoop installation needs to be on the CLASSPATH of your Java application for it to get found and applied. Another way of ensuring that no set configuration gets overridden by any Job is to set those properties as final; for example:

<name>mapreduce.task.io.sort.mb</name>
<value>400</value>
<final>true</final>

Setting configuration properties as final is a common thing Administrators do, as is noted in the Configuration API docs.

A better alternative would be to have a service serve up the Cluster's configuration to you upon request, in code. https://issues.apache.org/jira/browse/HADOOP-5670 may be of some interest in this regard, perhaps.

2.4. Can I write create/write-to hdfs files directly from map/reduce tasks?

Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by OutputCollector.)

Caveats:

${mapred.output.dir} is the eventual output directory for the job (JobConf.setOutputPath / JobConf.getOutputPath).

${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).

With speculative-execution on, one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.)

To get around this the framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_${taskid} sub-dir for each reduce task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly - thus you don't have to pick unique paths per task-attempt.

Fine-print: the value of ${mapred.output.dir} during execution of a particular reduce task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by JobConf.setOutputPath. So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.

For map task attempts, the automatic substitution of ${mapred.output.dir}/_${taskid} for ${mapred.output.dir} does not take place. You can still access the map task attempt directory, though, by using FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files created there will be dealt with as described above.

The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.

2.5. How do I get each of a job's maps to work on one complete input-file and not allow the framework to split-up the files?

Essentially a job's input is represented by the InputFormat(interface)/FileInputFormat(base class).

For this purpose one would need a 'non-splittable' FileInputFormat i.e. an input-format which essentially tells the map-reduce framework that it cannot be split-up and processed. To do this you need your particular input-format to return false for the isSplittable call.

E.g. org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat in src/test/org/apache/hadoop/mapred/SortValidator.java

In addition to implementing the InputFormat interface and having isSplitable(...) returning false, it is also necessary to implement the RecordReader interface for returning the whole content of the input file. (default is LineRecordReader, which splits the file into separate lines)

The other, quick-fix option, is to set mapred.min.split.size to large enough value.

2.6. Why I do see broken images in jobdetails.jsp page?

In hadoop-0.15, Map / Reduce task completion graphics are added. The graphs are produced as SVG(Scalable Vector Graphics) images, which are basically xml files, embedded in html content. The graphics are tested successfully in Firefox 2 on Ubuntu and MAC OS. However for other browsers, one should install an additional plugin to the browser to see the SVG images. Adobe's SVG Viewer can be found at http://www.adobe.com/svg/viewer/install/.

2.7. I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker, how do I increase that?

Use the configuration knob: mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to control the number of maps/reduces spawned simultaneously on a TaskTracker. By default, it is set to 2, hence one sees a maximum of 2 maps and 2 reduces at a given instance on a TaskTracker.

You can set those on a per-tasktracker basis to accurately reflect your hardware (i.e. set those to higher nos. on a beefier tasktracker etc.).

2.8. Submitting map/reduce jobs as a different user doesn't work.

The problem is that you haven't configured your map/reduce system directory to a fixed value. The default works for single node systems, but not for "real" clusters. I like to use:

<property>
   <name>mapred.system.dir</name>
   <value>/hadoop/mapred/system</value>
   <description>The shared directory where MapReduce stores control files.
   </description>
</property>

Note that this directory is in your default file system and must be accessible from both the client and server machines and is typically in HDFS.

2.9. How do Map/Reduce InputSplit's handle record boundaries correctly?

It is the responsibility of the InputSplit's RecordReader to start and end at a record boundary. For SequenceFile's every 2k bytes has a 20 bytes sync mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split. The RecordReader continues processing records until it reaches the first sync mark after the end of the split. The first split of each file naturally starts immediately and not after the first sync mark. In this way, it is guaranteed that each record will be processed by exactly one mapper.

Text files are handled similarly, using newlines instead of sync marks.

2.10. How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001?

You can subclass the OutputFormat.java class and write your own. You can locate and browse the code of TextOutputFormat, MultipleOutputFormat.java, etc. for reference. It might be the case that you only need to do minor changes to any of the existing Output Format classes. To do that you can just subclass that class and override the methods you need to change.

2.11. When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()?

It appears that DatanodeID.getHost() is the standard place to retrieve this name, and the machineName variable, populated in DataNode.java\#startDataNode, is where the name is first set. The first method attempted is to get "slave.host.name" from the configuration; if that is not available, DNS.getDefaultHost is used instead.

2.12. How do you gracefully stop a running job?

hadoop job -kill JOBID

2.13. How do I limit (or increase) the number of concurrent tasks a job may have running total at a time?

2.14. How do I limit (or increase) the number of concurrent tasks running on a node?

For both answers, see LimitingTaskSlotUsage.

3. HDFS

3.1. If I add new DataNodes to the cluster will HDFS move the blocks to the newly added nodes in order to balance disk space utilization between the nodes?

No, HDFS will not move blocks to new nodes automatically. However, newly created files will likely have their blocks placed on the new nodes.

There are several ways to rebalance the cluster manually.

Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names.
A simpler way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down.
Yet another way to re-balance blocks is to turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes, so you really get them rebalanced not just removed from the current node.
Finally, you can use the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically. See

3.2. What is the purpose of the secondary name-node?

The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure.

The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node. See User Guide.

So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case.

3.3. Does the name-node stay in safe mode till all under-replicated files are fully replicated?

No. During safe mode replication of blocks is prohibited. The name-node awaits when all or majority of data-nodes report their blocks.

Depending on how safe mode parameters are configured the name-node will stay in safe mode until a specific percentage of blocks of the system is minimally replicated dfs.replication.min. If the safe mode threshold dfs.safemode.threshold.pct is set to 1 then all blocks of all files should be minimally replicated.

Minimal replication does not mean full replication. Some replicas may be missing and in order to replicate them the name-node needs to leave safe mode.

Learn more about safe mode in the HDFS Users' Guide.

3.4. How do I set up a hadoop node to use multiple volumes?

Data-nodes can store blocks in multiple directories typically allocated on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as a value of the configuration parameter dfs.datanode.data.dir. Data-nodes will attempt to place equal amount of data in each of the directories.

The name-node also supports multiple directories, which in the case store the name space image and the edits log. The directories are specified via the dfs.namenode.name.dir configuration parameter. The name-node directories are used for the name space data replication so that the image and the log could be restored from the remaining volumes if one of them fails.

3.5. What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it?

Starting with release hadoop-0.15, a file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.

3.6. I want to make a large cluster smaller by taking out a bunch of nodes simultaneously. How can this be done?

On a large cluster removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead. With a large number of nodes getting removed or dying the probability of losing data is higher.

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup. It could be a zero length file. You must use the full hostname, ip or ip:port format in this file. (Note that some users have trouble using the host name. If your namenode shows some nodes in "Live" and "Dead" but not decommission, try using the full ip:port.) Then the shell command

bin/hadoop dfsadmin -refreshNodes

should be called, which forces the name-node to re-read the exclude file and start the decommission process.

Decommission is not instant since it requires replication of potentially a large number of blocks and we do not want the cluster to be overwhelmed with just this one job. The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned". The nodes can be removed whenever decommission is finished.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command.

3.7. Wildcard characters doesn't work correctly in FsShell.

When you issue a command in FsShell, you may want to apply that command to more than one file. FsShell provides a wildcard character to help you do so. The * (asterisk) character can be used to take the place of any set of characters. For example, if you would like to list all the files in your account which begin with the letter x, you could use the ls command with the * wildcard:

bin/hadoop dfs -ls x*

Sometimes, the native OS wildcard support causes unexpected results. To avoid this problem, Enclose the expression in Single or Double quotes and it should work correctly.

bin/hadoop dfs -ls 'in*'

3.8. Can I have multiple files in HDFS use different block sizes?

Yes. HDFS provides api to specify block size when you create a file.
See FileSystem.create(Path, overwrite, bufferSize, replication, blockSize, progress)

3.9. Does HDFS make block boundaries between records?

No, HDFS does not provide record-oriented API and therefore is not aware of records and boundaries between them.

3.10. What happens when two clients try to write into the same HDFS file?

HDFS supports exclusive writes only.
When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client.

3.11. How to limit Data node's disk usage?

Use dfs.datanode.du.reserved configuration value in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.

  <property>
    <name>dfs.datanode.du.reserved</name>
    <!-- cluster variant -->
    <value>182400</value>
    <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
    </description>
  </property>

3.12. On an individual data node, how do you balance the blocks on the disk?

Hadoop currently does not have a method by which to do this automatically. To do this manually:

Shutdown the DataNode involved
Use the UNIX mv command to move the individual block replica and meta pairs from one directory to another on the selected host. On releases which have HDFS-6482 (Apache Hadoop 2.6.0+) you also need to ensure the subdir-named directory structure remains exactly the same when moving the blocks across the disks. For example, if the block replica and its meta pair were under /data/1/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/, and you wanted to move it to /data/5/ disk, then it MUST be moved into the same subdirectory structure underneath that, i.e. /data/5/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/. If this is not maintained, the DN will no longer be able to locate the replicas after the move.
Restart the DataNode.

3.13. What does "file could only be replicated to 0 nodes, instead of 1" mean?

The NameNode does not have any available DataNodes. This can be caused by a wide variety of reasons. Check the DataNode logs, the NameNode logs, network connectivity, ... Please see the page: CouldOnlyBeReplicatedTo

3.14. If the NameNode loses its only copy of the fsimage file, can the file system be recovered from the DataNodes?

No. This is why it is very important to configure dfs.namenode.name.dir to write to two filesystems on different physical hosts, use the SecondaryNameNode, etc.

3.15. I got a warning on the NameNode web UI "WARNING : There are about 32 missing blocks. Please check the log or run fsck." What does it mean?

This means that 32 blocks in your HDFS installation don’t have a single replica on any of the live DataNodes.
Block replica files can be found on a DataNode in storage directories specified by configuration parameter dfs.datanode.data.dir. If the parameter is not set in the DataNode’s hdfs-site.xml, then the default location /tmp will be used. This default is intended to be used only for testing. In a production system this is an easy way to lose actual data, as local OS may enforce recycling policies on /tmp. Thus the parameter must be overridden.
If dfs.datanode.data.dir correctly specifies storage directories on all DataNodes, then you might have a real data loss, which can be a result of faulty hardware or software bugs. If the file(s) containing missing blocks represent transient data or can be recovered from an external source, then the easiest way is to remove (and potentially restore) them. Run fsck in order to determine which files have missing blocks. If you would like (highly appreciated) to further investigate the cause of data loss, then you can dig into NameNode and DataNode logs. From the logs one can track the entire life cycle of a particular block and its replicas.

3.16. If a block size of 64MB is used and a file is written that uses less than 64MB, will 64MB of disk space be consumed?

Short answer: No.

Longer answer: Since HFDS does not do raw disk block storage, there are two block sizes in use when writing a file in HDFS: the HDFS blocks size and the underlying file system's block size. HDFS will create files up to the size of the HDFS block size as well as a meta file that contains CRC32 checksums for that block. The underlying file system store that file as increments of its block size on the actual raw disk, just as it would any other file.

4. Platform Specific

4.1. General

4.1.1. Problems building the C/C++ Code

While most of Hadoop is built using Java, a larger and growing portion is being rewritten in C and C++. As a result, the code portability between platforms is going down. Part of the problem is the lack of access to platforms other than Linux and our tendency to use specific BSD, GNU, or System V functionality in places where the POSIX-usage is non-existent, difficult, or non-performant.

That said, the biggest loss of native compiled code will be mostly performance of the system and the security features present in newer releases of Hadoop. The other Hadoop features usually have Java analogs that work albeit slower than their C cousins. The exception to this is security, which absolutely requires compiled code.

4.2. Mac OS X

4.2.1. Building on Mac OS X 10.6

Be aware that Apache Hadoop 0.22 and earlier require Apache Forrest to build the documentation. As of Snow Leopard, Apple no longer ships Java 1.5 which Apache Forrest requires. This can be accomplished by either copying /System/Library/Frameworks/JavaVM.Framework/Versions/1.5 and 1.5.0 from a 10.5 machine or using a utility like Pacifist to install from an official Apple package. http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard provides some step-by-step directions.

4.3. Solaris

4.3.1. Why do files and directories show up as DrWho and/or user names are missing/weird?

Prior to 0.22, Hadoop uses the 'whoami' and id commands to determine the user and groups of the running process. whoami ships as part of the BSD compatibility package and is normally not in the path. The id command's output is System V-style whereas Hadoop expects POSIX. Two changes to the environment are required to fix this:

Make sure /usr/ucb/whoami is installed and in the path, either by including /usr/ucb at the tail end of the PATH environment or symlinking /usr/ucb/whoami directly.
In hadoop-env.sh, change the HADOOP_IDENT_STRING thusly:

export HADOOP_IDENT_STRING=`/usr/xpg4/bin/id -u -n`

4.3.2. Reported disk capacities are wrong

Hadoop uses du and df to determine disk space used. On pooled storage systems that report total capacity of the entire pool (such as ZFS) rather than the filesystem, Hadoop gets easily confused. Users have reported that using fixed quota sizes for HDFS and MapReduce directories helps eliminate a lot of this confusion.

4.4. Windows

4.4.1. Building / Testing Hadoop on Windows

The Hadoop build on Windows can be run from inside a Windows (not cygwin) command prompt window.

Whether you set environment variables in a batch file or in System->Properties->Advanced->Environment Variables, the following environment variables need to be set:

set ANT_HOME=c:\apache-ant-1.7.1
set JAVA_HOME=c:\jdk1.6.0.4
set PATH=%PATH%;%ANT_HOME%\bin

then open a command prompt window, cd to your workspace directory (in my case it is c:\workspace\hadoop) and run ant. Since I am interested in running the contrib test cases I do the following:

ant -l build.log -Dtest.output=yes test-contrib

other targets work similarly. I just wanted to document this because I spent some time trying to figure out why the ant build would not run from a cygwin command prompt window. If you are building/testing on Windows, and haven't figured it out yet, this should get you started.

FAQ (last edited 2014-10-02 20:14:52 by QwertyManiac)