README.md
Running open-source R on EMR with RHadoop packages and RStudio
More documentation you can find at the AWS Big Data Blog post.Please copy all files to a S3 bucket.
With the following command you can start an EMR cluster with R, RHadoop and RStudio installed.
Please replace
<YOUR-X>
with your data:bucket="<YOUR_BUCKET>"
region="<YOUR_REGION>"
keypair="<YOUR_KEYPAIR>"
aws emr create-cluster --name emR-example \
--ami-version 3.2.1 \
--region $region \
--ec2-attributes KeyName=$keypair \
--no-auto-terminate \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \
--bootstrap-actions \
Name=emR_bootstrap,\
Path="s3://$bucket/emR_bootstrap.sh",\
Args=[--rstudio,--rhdfs,--plyrmr,--rexamples] \
--steps \
Name=HDFS_tmp_permission,\
Jar="s3://elasticmapreduce/libs/script-runner/script-runner.jar",\
Args="s3://$bucket/hdfs_permission.sh"
File documentation
-
emR_bootstrap.sh
- installs RStudio and RHadoop packages depending on the provided arguments on all EMR instances-
--rstudio
- installs rstudio-server, default false -
--rexamples
- adds R examples to the user home directory, default false -
--rhdfs
- installs rhdfs package, default false -
--plyrmr
- installs plyrmr package, default false -
--updater
- installs latest R version, default false -
--user
- sets user for rstudio, default "rstudio" -
--user-pw
- sets user-pw for user USER, default "rstudio" -
--rstudio-port
- sets rstudio port, default 80
-
hdfs_permission.sh
- fixes/tmp
permission in hdfs to provide temporary storage for R streaming jobsrmr2_example.R
- simple example for mapReduce jobs with R-
biganalyses_example.R
- bigger script with some bigger analyses using plyrmr package -
change_pw.R
- simple script to change Unix (rstudio user) password from R session
No comments:
Post a Comment