Menu
Large Data Set Analysis in the Cloud: Hadoop Gets a boost

Large Data Set Analysis in the Cloud: Hadoop Gets a boost

The changing nature of IT, as well as the rapid evolution of business processes, means that you'll likely face the need for an analytical tool like Hadoop in the very near future.

The advantage of this approach is that very large sets of data can be managed and processed in parallel across the machine pool managed by Hadoop. The map/reduce approach is sometimes criticized for being inefficient, since the overall data pool is processed each time a new analysis is desired. While it's true that repeated processing is typical, it's also true that in today's world it's impossible to know beforehand what kinds of analyses are going to be desired in the future. This means that optimizing to reduce the "inefficient" repeated processing would also limit the potential for exploring unforeseen ongoing analytical requirements. Moreover, processing is significantly cheaper than data, relatively speaking. A general rule of thumb is to optimize for the least efficient, most costly resource; with Internet-scale data, that means "wasting" processing to optimize storage.

While Hadoop may seem like a product you have no need for, don't dismiss it. The changing nature of IT, as well as the rapid evolution of business processes, means that you'll likely face the need for this kind of analytical tool in the very near future.

The power of Hadoop can be seen in how the NY Times used it to convert a 4 Tb collection of its pages from one format to another. The programmer assigned the task uploaded the data to a number of Amazon EC2 instances and then ran a Hadoop map/reduce transformation on the pages. Two days later, all of the pages had been converted, speeded up by the parallel processing across 20 machines made possible by Hadoop. Attempting to do the same conversion on one machine would have taken well over a month; attempting to perform the parallel processing without Hadoop would have necessitated creation of a large, complex grid fabric. Hadoop abstracted all the "plumbing" (i.e., spreading the data across machines, coordinating the parallel processing, ensuring that the job was executing properly, etc.) away from the programmer, enabling him to focus on the actual task: creating the document conversion routine executed in the reduce phase of the process.

Hadoop has attracted vendor attention. A new startup, Cloudera, offers certified releases and regular updates, as well as technical support. (Incidentally, Cloudera offers an set of online Hadoop courses, which are excellent). And just last week, Amazon announced it will offer Hadoop support in its Elastic MapReduce, making Hadoop available in the cloud.

The interesting thing about the two Hadoop offerings is that they both bring something unique to the table. Amazon removes the need for a Hadoop user to locate spare computing resources, always a tough task to accomplish in a typical corporate data center-I mean, who has 15 or 20 machines sitting around idle, just waiting to be used for Hadoop? On the other hand, Cloudera's offering avoids the need to upload large amounts of data to Amazon-a challenge given the limited bandwidth available to most companies; using Amazon's offering also imposes data movement costs, since Amazon charges for data movement in and out of AWS.

I suspect that both offerings will prove popular going forward. Each will be used by companies grappling with the need to analyze Internet-scale data. Depending upon the particular project or company constraints, one solution or the other will end up being preferred. In fact, I would not be surprised to see many companies embrace both approaches to Hadoop, once they begin to understand its power.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags cloud computingbusiness intelligencehadoop

More about Amazon Web ServicesC2etworkGoogleStratus

Show Comments
[]