Collaborator
irskep
commented
EMR jobs should be sized to fit the largest single step of
your job flow. It is intentionally designed to run steps in serial. If
you want to run multiple jobs at once, use multiple job flows! I doubt
that running multiple jobs on one flow would actually save you much (if
any) money.
Additionally, by circumventing EMR's infrastructure to add extra jobs, you lose the ability to monitor the job, which is one of EMR's major features.
Additionally, by circumventing EMR's infrastructure to add extra jobs, you lose the ability to monitor the job, which is one of EMR's major features.
computableinsights
commented
I so wish that you were right!
Some jobs, such as fetching web pages from a large number of remote servers, have a long-tailed distribution of task durations. Given inputs of the same length (e.g. same number of URLs to fetch from each remote host), some can finish in seconds and others minutes. Combine that with random node failures in a 200+ node cluster, and you can easily get >90% completion in a couple minutes and >20min to reach 100%. That means tons of wasted capacity.
After tuning batch sizes and timeouts aggressively, I can get it down to about 60% wasted, which some might consider tolerable. Still, it would be much better to use the Fair Scheduler.
I'm about to try using whirr to setup a raw hadoop cluster in EC2 to see if I can get mrjob to work on that. Too bad that we cannot just submit directly to the hadoop cluster underlying a EMR Job Flow -- that would let us use all the tools for managing a job flow and also use the Fair Scheduler.
Also, circumventing the EMR infrastructure to add extra jobs does NOT lose the ability to monitor the jobs. Just tunnel to the jobtracker and scheduler web pages and all the jobs show up individually --- or did you mean something else?
Some jobs, such as fetching web pages from a large number of remote servers, have a long-tailed distribution of task durations. Given inputs of the same length (e.g. same number of URLs to fetch from each remote host), some can finish in seconds and others minutes. Combine that with random node failures in a 200+ node cluster, and you can easily get >90% completion in a couple minutes and >20min to reach 100%. That means tons of wasted capacity.
After tuning batch sizes and timeouts aggressively, I can get it down to about 60% wasted, which some might consider tolerable. Still, it would be much better to use the Fair Scheduler.
I'm about to try using whirr to setup a raw hadoop cluster in EC2 to see if I can get mrjob to work on that. Too bad that we cannot just submit directly to the hadoop cluster underlying a EMR Job Flow -- that would let us use all the tools for managing a job flow and also use the Fair Scheduler.
Also, circumventing the EMR infrastructure to add extra jobs does NOT lose the ability to monitor the jobs. Just tunnel to the jobtracker and scheduler web pages and all the jobs show up individually --- or did you mean something else?
Collaborator
irskep
commented
Also, circumventing the EMR infrastructure to add extra jobs does NOT lose the ability to monitor the jobs. Just tunnel to the jobtracker and scheduler web pages and all the jobs show up individually --- or did you mean something else?I meant that when you
emr --describe j-BLAH
your job won't show up.Additionally, if you start step A with the EMR interface and step X with raw Hadoop, and step A completes before step X, Amazon may terminate your job flow and you lose step X entirely.
computableinsights
commented
True, emr --describe j-BLAH is insufficient for working with many concurrent jobs.
Your second point is also true, one would create the Job Flow to get the cluster running and then never submit a job through the Job Flow, only through the hadoop job client.
I think this could be valuable functionality, especially because schedulers are such a key part of hadoop's existing infrastructure.
Your second point is also true, one would create the Job Flow to get the cluster running and then never submit a job through the Job Flow, only through the hadoop job client.
I think this could be valuable functionality, especially because schedulers are such a key part of hadoop's existing infrastructure.
Collaborator
irskep
commented
I have to disagree. The EMR runner's function is to work
with the EMR APIs, which are allowed to define a different scope from
the larger Hadoop system. If you want to use an EMR cluster as a vanilla
Hadoop cluster, then wrap the Hadoop runner to point at it or run on
it.
Collaborator
davidmarin
commented
Thanks for the writeup. I agree, that sucks that your job is still technically running, but most of the nodes are idle.
My impression of EMR is that FIFO execution is a pretty strong assumption, and that trying to get EMR to run multiple jobs simultaneously might be somewhat quixotic. Still, if you can hack it and make it work, please let us know, because that would be really useful. :)
My impression of EMR is that FIFO execution is a pretty strong assumption, and that trying to get EMR to run multiple jobs simultaneously might be somewhat quixotic. Still, if you can hack it and make it work, please let us know, because that would be really useful. :)
Collaborator
irskep
commented
Close if no plans to implement. Given the time I spent
getting mrjob to suck down logs over SSH, I'm convinced that this would
be a huge pain and an inappropriately large amount of code to implement.
Collaborator
davidmarin
commented
Yeah, I think this is what EC2/Cloudera is for (or any
number of other services that let you set up a Hadoop cluster in the
cloud). And for that, you can just use the
hadoop
runner.
davidmarin
closed this
tarnfeld
commented
Just wanted to add my two cents here, since i've been doing
a lot of research/work into this recently. I've been using mrjob with
EMR (> 256 job runs, more flexibility) by spinning up a persistent
EMR cluster directly using
A heads up for anyone attempting to use the
The fork amazon use in all of their AMIs fixes a bug with how HDFS forces URIs (it'll error telling you to send to the hostname of the jobtracker instead of the IP, and vice versa indefinitely). However, if you ssh into the master and run jobs from there you'll be all good.
A note also about the scheduler, by default yes EMR uses FIFO – however using the EMR bootstrap action you can change the scheduler to the
mrjob.tools
.A heads up for anyone attempting to use the
hadoop
runner to fire jobs directly at the master node (by modifying your local
hadoop config files) – unless you're running on a patched version
hadoop you won't be able to submit jobs. The fork amazon use in all of their AMIs fixes a bug with how HDFS forces URIs (it'll error telling you to send to the hostname of the jobtracker instead of the IP, and vice versa indefinitely). However, if you ssh into the master and run jobs from there you'll be all good.
A note also about the scheduler, by default yes EMR uses FIFO – however using the EMR bootstrap action you can change the scheduler to the
org.apache.hadoop.mapred.FairScheduler
fairly easily at cluster boot time.Additionally, by circumventing EMR's infrastructure to add extra jobs, you lose the ability to monitor the job, which is one of EMR's major features.This is partly true, the cluster will constantly be in the "WAITING" state, however EMR will still continue to dump log files for each job/task/attempt in your logging bucket, which is handy. It will also still correctly show slot count stats and all the other metrics they provide via cloudwatch.
Collaborator
davidmarin
commented
Huh! Well, thanks for letting us know this works and is a real use case.
It sounds like if you used the
Maybe what we need is:
It sounds like if you used the
hadoop
runner with --hadoop-binary='ssh hadoop@<master node IP> hadoop'
, things would work unless you tried to copy things from the local filesystem to HDFS. Have you tried this?Maybe what we need is:
- some sort of
--ssh
option that tells us that we have to run hadoop over ssh, that canscp
local files. - an informative message from
mrjob.tools.emr.create_job_flow
that tells us the option(s) to pass to the hadoop runner if we'd rather do Hadoop-over-ssh than use theemr
runner.
tarnfeld
commented
Heh, that's a neat idea. It's hard to explain the use case
without getting into the specifics, but i'd love to discuss
implementations.
For production EMR works well even with invoking mrjob from the master node (with a bit of tooling around the whole thing) as the master node then saves us having to have somewhere else that we have to invoke mrjob from, and when using the hadoop runner, that has hadoop configured correctly.
It'd be interesting to see how others feel about this sort of implementation, having this feature would also mean you're no longer restricted to invoking hadoop on the machine you're invoking mrjob from which is quite useful in other cases.
For production EMR works well even with invoking mrjob from the master node (with a bit of tooling around the whole thing) as the master node then saves us having to have somewhere else that we have to invoke mrjob from, and when using the hadoop runner, that has hadoop configured correctly.
some sort of --ssh option that tells us that we have to run hadoop over ssh, that can scp local files.I'd be a big fan of this, as described above the workflow works well for production, but when developing your jobs it creates a whole new problem if getting code onto the master node while you're constantly making changes. This approach might be an elegant solution to that.
an informative message from mrjob.tools.emr.create_job_flow that tells us the option(s) to pass to the hadoop runner if we'd rather do Hadoop-over-ssh than use the emr runner.I think the only option that would actually need to be sent to the job invocation would be the EMR job flow ID - which is already output from the
mrjob.tools.emr.create_job_flow
script It'd be interesting to see how others feel about this sort of implementation, having this feature would also mean you're no longer restricted to invoking hadoop on the machine you're invoking mrjob from which is quite useful in other cases.
tarnfeld
commented
Also @irskep
Update: Nope I never did, see duedil-ltd/mrjob#4 + duedil-ltd/mrjob#8 if you're interested
Given the time I spent getting mrjob to suck down logs over SSH, I'm convinced that this would be a huge pain and an inappropriately large amount of code to implement.With the fixes I pushed to support pulling counters from the
_logs
subdirectory of the output directory this won't be an issue (I don't think?)..
though come to think of it, i'm not sure I ever sent these fixes
upstream in fear of it breaking the initial implementation.Update: Nope I never did, see duedil-ltd/mrjob#4 + duedil-ltd/mrjob#8 if you're interested
Collaborator
davidmarin
commented
I think the only option that would actually need to be sent to the job invocation would be the EMR job flow ID - which is already output from theI'd really rather not send the EMR job flow ID or anything else AWS-specific to themrjob.tools.emr.create_job_flow
script
hadoop
runner as it would make it less clear which runner is intended for use
on EMR, and which is intended for general use. So I'd rather treat this
use case as supporting a hadoop cluster which can only be accessed over
SSH (which just happens to usually be on EMR). Of course it's fair game
for mrjob.tools.emr
to print out flags which might be useful to the hadoop
runner.Thanks for the patches! Yeah, if you have a generally useful patch, it's just common courtesy to submit it back to the main branch. If it breaks something, it's really the maintainers' fault. :)
tarnfeld
commented
That's a fair point, from my point of view just specifying the required arguments for the SSH connection would suffice.
Yeah, if you have a generally useful patch, it's just common courtesy to submit it back to the main branch.I've sent a few, and I re-applied all of my patches to make it clear what they were all for via pull requests on duedil-ltd/mrjob but haven't yet got round to sending them upstream. I'll try to do that shortly. :)
Is there a way to submit a job directly to the JobTracker of the hadoop cluster created by a jobflow?
I bet there is a way to do this now -- albeit probably without the typical ease of use in other mrjob actions :-)
This would allow running of multiple concurrent jobs, which appears to be impossible through the EMR Job Flow interface, because it is a sequence of one-job-sized steps. Or perhaps there is another way to take advantage of the Fair Scheduler in AWS EMR?