R & hadoop: How to Move Data between Amazon S3 and HDFS in EMR

Summary

When using an Amazon Elastic MapReduce (EMR) cluster, any data stored in the HDFS file system is temporary and ceases to exist once the cluster is terminated. Amazon Simple Storage Service (Amazon S3) provides permanent storage for data such as input files, log files, and output files written to HDFS.

The open-source utility s3distcp can be used to move data between S3 and HDFS. This command can be invoked in a custom task as part of a job that includes a MapReduce job as a subjob.

Note that the instructions provided here apply to Syncsort Ironcluster™ Release 1.

Resolution

While data stored in the HDFS file system of an Amazon EMR cluster is lost once the cluster is terminated, Amazon S3 can be used to store and retrieve data that you want to keep permanently.

The utility s3distcp can be used to move data from Amazon S3 to an HDFS file system and back. In the attached example, 46622_FileCDC_ExtractAndLoadToS3_7.9.zip, the MRJ_FileCDC_ExtractAndLoadToS3.dxj job (compatible with DMExpress 7.9 or later) contains the following components:

a custom task, ExtractFromS3, that extracts data from S3 to HDFS using s3distcp
a subjob that runs the FileCDC use case accelerator
another custom task, LoadToS3, that loads data from HDFS back to S3 using s3distcp

Note that the included Use Case Accelerators do not require ExtractFromS3 and LoadToS3, as the data for these is loaded into HDFS from the local file system on the Linux ETL Server instance by running the prep_dmx_example.sh script.

Following are the steps to be taken to create and run this example job.

Create a New Job

Open the DMExpress Job Editor on the Windows instance, click on File->Save Job As… to create a new job, give it a name, and save it to the desired folder on the ETL Server instance.

Create a Custom Task to Move Data from S3 to HDFS

Select Edit->Add Custom Task… to create a custom task to run the following s3distcp command, which copies data from Amazon S3 to the source directory in the HDFS file system used by the DMX-h ETL MapReduce job:

hadoop
jar /etc/hadoop-2.2.0/share/hadoop/tools/lib/s3distcp.jar --src s3n://$YOUR_S3_BUCKET/$S3_SOURCE_DIR
--dest $HDFS_SOURCE_DIR

Populate the dialog as shown below, click OK, then click to place the custom task on the job canvas.

Add the DMExpress MapReduce Job as a Subjob

Select Edit->Add DMExpress Job…, navigate to the desired .dxj job file, select it, click OK, and click to place the job on the canvas to the right of the custom task.

Click on the Sequence toolbar button and draw an arrow from the custom task to the subjob as shown below:

Create a Custom Task to Move Data from HDFS to S3

Select Edit->Add Custom Task… to create a custom task to run the following s3distcp command, which copies data from the HDFS target directory used by the DMX-h ETL MapReduce job to Amazon S3:

hadoop
jar /etc/hadoop-2.2.0/share/hadoop/tools/lib/s3distcp.jar --src $HDFS_TARGET_DIR
--dest s3n://$YOUR_S3_BUCKET/$S3_TARGET_DIR

Populate the dialog as shown below, click OK, then click to place the custom task on the job canvas to the right of the subjob.

Click on the Sequence toolbar button and draw an arrow from the subjob to this second custom task as shown below:

Prepare the environment and run the job

DMExpress jobs can be run from the command line (via dmxjob) or the GUI (via the Run button in the DMExpress Job Editor).

When running from the command line, any environment variables must be specified in the form:
export VARIABLE_NAME = variable_value
When running from the GUI, they must be specified in the Environment variables tab of the DMExpress Server dialog (raised by clicking on the Status toolbar button, with the Server set to the ETL Server instance and the user set to hadoop) in the form:
VARIABLE_NAME = variable_value

To prepare the environment and run the job, do the following:

Specify the environment variables that were used in the s3distcp invocations as follows:

YOUR_S3_BUCKET = <name of your S3 bucket>
S3_SOURCE_DIR = <directory in S3 where the source files are located>
HDFS_SOURCE_DIR = <directory where you want to store the source files in HDFS>
HDFS_TARGET_DIR = <directory in HDFS where target files are located>
S3_TARGET_DIR = <directory in S3 where you want to write the target files>

The variables HDFS_SOURCE_DIR and HDFS_TARGET_DIR must be passed to Hadoop via the /EXPORT option to dmxjob when running from the command line. See "Running a job from the command prompt" in the DMExpress Help for details. For example:
dmxjob /RUN job.dxj /HADOOP /EXPORT HDFS_SOURCE_DIR HDFS_TARGET_DIR
Get the AWS account Access Key ID and Secret Access Key from your Amazon account administrator.
Modify the $HADOOP_HOME/core_site.xml file on the ETL node to include the following entries, where $HADOOP_HOME is the directory where you have Hadoop installed, such as /etc/hadoop-2.2.0/etc/hadoop/, and YOUR_ACCESSKEY and YOUR_SECRETKEY are the actual Access Key ID and Secret Access Key retrieved in the previous step:

<property <name>fs.s3n.awsAccessKeyId</name> <value>YOUR_ACCESSKEY</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>YOUR_SECRETKEY</value> </property> <property> <name>fs.s3.awsAccessKeyId</name> <value>YOUR_ACCESSKEY</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>YOUR_SECRETKEY</value> </property>
Modify your HADOOP_CLASSPATH environment variable (setting it in the same way as the other environment variables) to include the $HADOOP_HOME/core-site.xml file and the AWS SDK jar files for Java as follows:

HADOOP_CLASSPATH=${HADOOP_HOME}/core-site.xml:/etc/hadoop-2.2.0/share/hadoop/tools/lib/aws-java-sdk-1.6.4.jar:/etc/hadoop-2.2.0/share/hadoop/tools/lib/httpcore-4.2.jar:/etc/hadoop-2.2.0/share/hadoop/tools/lib/httpclient-4.2.3.jar:$HADOOP_CLASSPATH
Run the job from the GUI if the environment variables were set in the DMExpress Server dialog, or from the command line if they were set there.

Additional Information

For details on S3 in general, see aws.amazon.com/s3.

For step-by-step guidelines on how to start using Amazon S3, how to create your own bucket, and how to load data into your bucket, see docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html.

For additional information on running jobs, see "Running DMX-h ETL Jobs" in the DMExpress Help.

For details on Ironcluster (DMX-h on Amazon's Elastic MapReduce platform), see Syncsort Ironcluster™ Hadoop ETL, Amazon EMR Edition.

Applies To

DMX-h ETL Edition 7.*

Cross References

Syncsort Ironcluster™ Hadoop ETL, Amazon EMR Edition (46621-2)

Keywords

Config, HowTo, Hadoop, Linux

R & hadoop

How to Move Data between Amazon S3 and HDFS in EMR