Amazon Elastic MapReduce

Based on Hadoop, MapReduce equips users with potent distributed data-processing tools

Comments

A review step follows. Once you approve your configuration, your job is launched, and you return to the Job Flows page where the job's progress is monitored. When the job completes, your output data will be stored in the S3 bucket you specified.

Users repelled by Web-based graphical management consoles (such as the AWS Management Console) will be happy to discover that Elastic MapReduce can be powered by a command-line interface. This interface executes in the Ruby programming language (a free download) and provides a single command that sports a battalion of parameters. You can create job flows, define inputs, specify map and reduce functions, and generally do anything covered in the AWS management console.

Personal distributed computing

Setting up an Amazon Elastic MapReduce job flow is remarkably easy. New users should run one of the supplied example applications to familiarize themselves with the complete process. I would also recommend setting the optional parameter for generating log files. The resulting logs are comprehensive and can be confusing if you're new to Hadoop, but they helped me track down repeated failures in my first attempts.

Amazon claims to have tweaked the behavior of its implementation of Hadoop to work optimally with S3. Amazon was guarded about the details of this tweaking, so we'll have to take the company at its word as to the benefits of the optimizations. Nevertheless, if you have a large-scale distributed processing problem but a small-scale budget, you should familiarize yourself with Hadoop, then take Amazon's Elastic MapReduce for a spin.