in AWS EMR: submit a job directly to the JobTracker of the hadoop cluster created by a jobflow?

Based on #385
Is there a way to submit a job directly to the JobTracker of the hadoop cluster created by a jobflow?
I bet there is a way to do this now -- albeit probably without the typical ease of use in other mrjob actions :-)
This would allow running of multiple concurrent jobs, which appears to be impossible through the EMR Job Flow interface, because it is a sequence of one-job-sized steps. Or perhaps there is another way to take advantage of the Fair Scheduler in AWS EMR?
Steve Johnson
Collaborator
EMR jobs should be sized to fit the largest single step of your job flow. It is intentionally designed to run steps in serial. If you want to run multiple jobs at once, use multiple job flows! I doubt that running multiple jobs on one flow would actually save you much (if any) money.
Additionally, by circumventing EMR's infrastructure to add extra jobs, you lose the ability to monitor the job, which is one of EMR's major features.
computableinsights
I so wish that you were right!
Some jobs, such as fetching web pages from a large number of remote servers, have a long-tailed distribution of task durations. Given inputs of the same length (e.g. same number of URLs to fetch from each remote host), some can finish in seconds and others minutes. Combine that with random node failures in a 200+ node cluster, and you can easily get >90% completion in a couple minutes and >20min to reach 100%. That means tons of wasted capacity.
After tuning batch sizes and timeouts aggressively, I can get it down to about 60% wasted, which some might consider tolerable. Still, it would be much better to use the Fair Scheduler.
I'm about to try using whirr to setup a raw hadoop cluster in EC2 to see if I can get mrjob to work on that. Too bad that we cannot just submit directly to the hadoop cluster underlying a EMR Job Flow -- that would let us use all the tools for managing a job flow and also use the Fair Scheduler.
Also, circumventing the EMR infrastructure to add extra jobs does NOT lose the ability to monitor the jobs. Just tunnel to the jobtracker and scheduler web pages and all the jobs show up individually --- or did you mean something else?
Steve Johnson
Collaborator
Also, circumventing the EMR infrastructure to add extra jobs does NOT lose the ability to monitor the jobs. Just tunnel to the jobtracker and scheduler web pages and all the jobs show up individually --- or did you mean something else?
I meant that when you emr --describe j-BLAH your job won't show up.
Additionally, if you start step A with the EMR interface and step X with raw Hadoop, and step A completes before step X, Amazon may terminate your job flow and you lose step X entirely.
computableinsights
True, emr --describe j-BLAH is insufficient for working with many concurrent jobs.
Your second point is also true, one would create the Job Flow to get the cluster running and then never submit a job through the Job Flow, only through the hadoop job client.
I think this could be valuable functionality, especially because schedulers are such a key part of hadoop's existing infrastructure.
Steve Johnson
Collaborator
I have to disagree. The EMR runner's function is to work with the EMR APIs, which are allowed to define a different scope from the larger Hadoop system. If you want to use an EMR cluster as a vanilla Hadoop cluster, then wrap the Hadoop runner to point at it or run on it.
David Marin
Collaborator
Thanks for the writeup. I agree, that sucks that your job is still technically running, but most of the nodes are idle.
My impression of EMR is that FIFO execution is a pretty strong assumption, and that trying to get EMR to run multiple jobs simultaneously might be somewhat quixotic. Still, if you can hack it and make it work, please let us know, because that would be really useful. :)
Steve Johnson
Collaborator
Close if no plans to implement. Given the time I spent getting mrjob to suck down logs over SSH, I'm convinced that this would be a huge pain and an inappropriately large amount of code to implement.
David Marin
Collaborator
Yeah, I think this is what EC2/Cloudera is for (or any number of other services that let you set up a Hadoop cluster in the cloud). And for that, you can just use the hadoop runner.
Tom Arnfeld
Just wanted to add my two cents here, since i've been doing a lot of research/work into this recently. I've been using mrjob with EMR (> 256 job runs, more flexibility) by spinning up a persistent EMR cluster directly using mrjob.tools.
A heads up for anyone attempting to use the hadoop runner to fire jobs directly at the master node (by modifying your local hadoop config files) – unless you're running on a patched version hadoop you won't be able to submit jobs.
The fork amazon use in all of their AMIs fixes a bug with how HDFS forces URIs (it'll error telling you to send to the hostname of the jobtracker instead of the IP, and vice versa indefinitely). However, if you ssh into the master and run jobs from there you'll be all good.
A note also about the scheduler, by default yes EMR uses FIFO – however using the EMR bootstrap action you can change the scheduler to the org.apache.hadoop.mapred.FairScheduler fairly easily at cluster boot time.
Additionally, by circumventing EMR's infrastructure to add extra jobs, you lose the ability to monitor the job, which is one of EMR's major features.
This is partly true, the cluster will constantly be in the "WAITING" state, however EMR will still continue to dump log files for each job/task/attempt in your logging bucket, which is handy. It will also still correctly show slot count stats and all the other metrics they provide via cloudwatch.
David Marin
Collaborator
Huh! Well, thanks for letting us know this works and is a real use case.
It sounds like if you used the hadoop runner with --hadoop-binary='ssh hadoop@<master node IP> hadoop', things would work unless you tried to copy things from the local filesystem to HDFS. Have you tried this?
Maybe what we need is:
  • some sort of --ssh option that tells us that we have to run hadoop over ssh, that can scp local files.
  • an informative message from mrjob.tools.emr.create_job_flow that tells us the option(s) to pass to the hadoop runner if we'd rather do Hadoop-over-ssh than use the emr runner.
Probably will re-open this as a new ticket; let me know what you think.
Tom Arnfeld
Heh, that's a neat idea. It's hard to explain the use case without getting into the specifics, but i'd love to discuss implementations.
For production EMR works well even with invoking mrjob from the master node (with a bit of tooling around the whole thing) as the master node then saves us having to have somewhere else that we have to invoke mrjob from, and when using the hadoop runner, that has hadoop configured correctly.
some sort of --ssh option that tells us that we have to run hadoop over ssh, that can scp local files.
I'd be a big fan of this, as described above the workflow works well for production, but when developing your jobs it creates a whole new problem if getting code onto the master node while you're constantly making changes. This approach might be an elegant solution to that.
an informative message from mrjob.tools.emr.create_job_flow that tells us the option(s) to pass to the hadoop runner if we'd rather do Hadoop-over-ssh than use the emr runner.
I think the only option that would actually need to be sent to the job invocation would be the EMR job flow ID - which is already output from the mrjob.tools.emr.create_job_flow script :+1:
It'd be interesting to see how others feel about this sort of implementation, having this feature would also mean you're no longer restricted to invoking hadoop on the machine you're invoking mrjob from which is quite useful in other cases.
Tom Arnfeld
Also @irskep
Given the time I spent getting mrjob to suck down logs over SSH, I'm convinced that this would be a huge pain and an inappropriately large amount of code to implement.
With the fixes I pushed to support pulling counters from the _logs subdirectory of the output directory this won't be an issue (I don't think?).. though come to think of it, i'm not sure I ever sent these fixes upstream in fear of it breaking the initial implementation.
Update: Nope I never did, see duedil-ltd/mrjob#4 + duedil-ltd/mrjob#8 if you're interested
David Marin
Collaborator
I think the only option that would actually need to be sent to the job invocation would be the EMR job flow ID - which is already output from the mrjob.tools.emr.create_job_flow script :+1:
I'd really rather not send the EMR job flow ID or anything else AWS-specific to the hadoop runner as it would make it less clear which runner is intended for use on EMR, and which is intended for general use. So I'd rather treat this use case as supporting a hadoop cluster which can only be accessed over SSH (which just happens to usually be on EMR). Of course it's fair game for mrjob.tools.emr to print out flags which might be useful to the hadoop runner.
Thanks for the patches! Yeah, if you have a generally useful patch, it's just common courtesy to submit it back to the main branch. If it breaks something, it's really the maintainers' fault. :)
Tom Arnfeld
That's a fair point, from my point of view just specifying the required arguments for the SSH connection would suffice.
Yeah, if you have a generally useful patch, it's just common courtesy to submit it back to the main branch.
I've sent a few, and I re-applied all of my patches to make it clear what they were all for via pull requests on duedil-ltd/mrjob but haven't yet got round to sending them upstream. I'll try to do that shortly. :)

No comments:

Post a Comment