Summary
When using an Amazon Elastic MapReduce (EMR) cluster, any data stored in the HDFS file system is temporary and ceases to exist once the cluster is terminated. Amazon Simple Storage Service (Amazon S3) provides permanent storage for data such as input files, log files, and output files written to HDFS.
The open-source utility s3distcp can be used to move data between S3 and HDFS. This command can be invoked in a custom task as part of a job that includes a MapReduce job as a subjob.
Note that the instructions provided here apply to Syncsort Ironcluster™ Release 1.
Resolution
While data stored in the HDFS file system of an
Amazon EMR cluster is lost once the cluster is terminated, Amazon S3 can be
used to store and retrieve data that you want
to keep permanently.
The utility
s3distcp
can be used to move data from Amazon S3 to an
HDFS file system and back. In the attached example, 46622_FileCDC_ExtractAndLoadToS3_7.9.zip,
the MRJ_FileCDC_ExtractAndLoadToS3.dxj
job
(compatible with DMExpress 7.9 or later) contains the following components:- a custom task, ExtractFromS3, that extracts data from S3 to HDFS using s3distcp
- a subjob that runs the FileCDC use case accelerator
- another custom task, LoadToS3, that loads data from HDFS back to S3 using s3distcp
Note that the included Use Case Accelerators do not require
ExtractFromS3 and LoadToS3, as the data for these is loaded into HDFS from the local file
system on the Linux ETL Server instance by running the
prep_dmx_example.sh
script.
Following
are the steps to be taken to create and run this example job.
Create a New Job
Open the
DMExpress Job Editor on the Windows instance, click on File->Save Job As… to create a new job, give it a name, and save
it to the desired folder on the ETL Server instance.
Create a Custom Task to Move Data from S3 to HDFS
Select Edit->Add Custom Task… to create a custom task to run the following
s3distcp
command, which copies data from Amazon S3 to the source directory in the HDFS
file system used by the DMX-h ETL MapReduce job:hadoop
jar /etc/hadoop-2.2.0/share/hadoop/tools/lib/s3distcp.jar --src s3n://$YOUR_S3_BUCKET/$S3_SOURCE_DIR
--dest $HDFS_SOURCE_DIR
Populate the dialog as shown below, click OK, then click to place the custom task on the job canvas.
Add the DMExpress MapReduce Job as a Subjob
Select Edit->Add
DMExpress Job…, navigate to the desired
.dxj
job file, select it, click OK, and click to place the job on the
canvas to the right of the custom task.
Click on the Sequence
toolbar button and draw an arrow from the custom task to the subjob as shown
below:
Create a Custom Task to Move Data from HDFS to S3
Select Edit->Add Custom Task… to create a custom task to run the following
s3distcp
command, which copies data from the HDFS target directory used by the DMX-h ETL
MapReduce job to Amazon S3:hadoop
jar /etc/hadoop-2.2.0/share/hadoop/tools/lib/s3distcp.jar --src $HDFS_TARGET_DIR
--dest s3n://$YOUR_S3_BUCKET/$S3_TARGET_DIR
Populate the dialog as shown below, click OK, then click to place the custom task on the job canvas to the
right of the subjob.
Click on the Sequence toolbar button and draw an arrow from the subjob to this second custom task as shown below:
Prepare the environment and run the job
DMExpress jobs can be run from the command line (via
dmxjob
) or
the GUI (via the Run button in the DMExpress Job Editor).- When running from the
command line, any environment variables must be specified in the form:
export VARIABLE_NAME = variable_value
- When running from the GUI, they must be
specified in the Environment
variables
tab of the DMExpress Server dialog (raised
by clicking on the Status toolbar
button, with the Server set to the ETL Server instance and the user set to
hadoop) in the form:
VARIABLE_NAME = variable_value
To prepare the environment and run the job, do the following:
- Specify the environment variables that were used in the
s3distcp
invocations as follows:YOUR_S3_BUCKET = <name of your S3 bucket>
S3_SOURCE_DIR = <directory in S3 where the source files are located>
HDFS_SOURCE_DIR = <directory where you want to store the source files in HDFS>
HDFS_TARGET_DIR = <directory in HDFS where target files are located>
S3_TARGET_DIR = <directory in S3 where you want to write the target files>
The variablesHDFS_SOURCE_DIR
andHDFS_TARGET_DIR
must be passed to Hadoop via the/EXPORT
option todmxjob
when running from the command line. See "Running a job from the command prompt" in the DMExpress Help for details. For example:dmxjob /RUN job.dxj /HADOOP /EXPORT HDFS_SOURCE_DIR HDFS_TARGET_DIR
- Get the AWS account Access Key ID and Secret Access Key from your Amazon account administrator.
- Modify the $HADOOP_HOME/core_site.xml
file on the ETL node to include the following entries, where
$HADOOP_HOME
is the directory where you have Hadoop installed, such as/etc/hadoop-2.2.0/etc/hadoop/
, andYOUR_ACCESSKEY
andYOUR_SECRETKEY
are the actual Access Key ID and Secret Access Key retrieved in the previous step:<property
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRETKEY</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_SECRETKEY</value>
</property>
- Modify your HADOOP_CLASSPATH
environment variable (setting it in the same way as the other environment
variables) to include the $HADOOP_HOME/core-site.xml file and the AWS SDK jar files for Java as follows:
HADOOP_CLASSPATH=${HADOOP_HOME}/core-site.xml:/etc/hadoop-2.2.0/share/hadoop/tools/lib/aws-java-sdk-1.6.4.jar:/etc/hadoop-2.2.0/share/hadoop/tools/lib/httpcore-4.2.jar:/etc/hadoop-2.2.0/share/hadoop/tools/lib/httpclient-4.2.3.jar:$HADOOP_CLASSPATH
- Run the job from the GUI if the
environment variables were set in the DMExpress Server dialog, or from the command
line if they were set there.
Additional Information
For details on S3 in general, see aws.amazon.com/s3.
For step-by-step guidelines on how to start using Amazon S3, how
to create your own bucket, and how to load data into your bucket, see docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html.
For
additional information on running jobs, see "Running DMX-h ETL Jobs" in the
DMExpress Help.
For details on Ironcluster (DMX-h on Amazon's Elastic MapReduce platform), see Syncsort Ironcluster™ Hadoop ETL, Amazon EMR Edition.
Applies To
DMX-h ETL Edition 7.*
Cross References
Syncsort Ironcluster™ Hadoop ETL, Amazon EMR Edition (46621-2)
Keywords
Config, HowTo, Hadoop, Linux
No comments:
Post a Comment