Tackling big data challenges with Hadoop

Storage costs and disparate systems common issues when doing big data and analytics

Comments

Big data and analytics sounds nice in theory but in practice it can be no mean feat. This partly due to limitations in technology and challenges in storage costs and disparate systems, said Eva Andreasson, pioneer of deterministic garbage collection and product manager at Cloudera.

Andreasson spoke at the YOW! Developer Conference in Sydney this week about the various tools within Hadoop to help tackle these challenges, providing real world examples from the health and retail industries.

Andreasson worked with a children's hospital in the United States that needed to create actionable insights from its data.

Its monitoring systems constantly generate data on a patient's respiratory system, heart rate, blood pressure and so on. However, the hospital could only afford to store this vital data on patients for three days post surgery or treatment.

"In their intensive care unit, they have a lot of babies and patients who don't know how to communicate. So all they know is what comes out of the monitoring systems. Imagine being a patient in that hospital and unable to communicate on day four if something feels wrong," Andreasson said.

Another problem for the hospital was its research data sat in disparate systems, making it difficult to exploit, which often resulted in long wait times for clinicians seeking particular information.

"They knew the most common reason for children in emergency was asthma related, and they had a lot of different sources of research, alongside 20 years of other research data externally. But they had no system to be able to correlate the datasets efficiently in their research group; it was always in different systems," Andreasson said.

The hospital invested in a Hadoop platform to deal with these challenges, and was able to make some savings. Andreasson said from the hardware to all the software licenses, it cost the hospital less than three processors used to run its previous traditional data management system.

The hospital can handle 50GB of monitoring data per week, and has 2TB of capacity to host all of its research data in the same, accessible cluster. Apache Sqoop was used for transferring data between relational databases and Hadoop.

Read more: Splunk moves on Hadoop analytics

Solr, open source tool for full text search, is being used by staff to further explore its various datasets and documents. Impala, a query tool, is being used to do analytics on real time monitoring data and investigate how a patient's health is going.

"Just within weeks, they were able to change their processes so that nurses could stay longer with their patients; they noticed that when a nurse stays for a few hours longer with their patient they were much more likely to recover better," Andreasson said. "Within months of installing this system, they decreased the number of asthma related illnesses."

Another example Andreasson gave was an online retailer, which like the children's hospital, previously only stored a limited amount of data - six months worth of customer-related transactions.

"What if there are patterns in 12 months, two years or five years that show which categories of customers drop off after a while or stop purchasing their products?," she asked.

"Hadoop can do this for either six months of data, a year or five years - it doesn't matter - because it is a linear scalable platform."

The retailer is also recording every click a visitor makes on the website, what he/she clicked on, duration spent on certain sections of the site, and making correlations between multi-type datasets.

Using Flume, each log event is consumed as it's being generated and posted to Hadoop Distributed File System. Tables are then created to query what are the most clicked on products compared with most purchased products.

"From my example, there's one product missing in the 'most sold' that pops up here in the click stream data. What does that mean? You gain new insight by adding datasets to serve your business questions.

"So the new insight is a kid's football is somehow very popular but it doesn't show up in our purchase list. So what can we do to gain more revenue around it? Maybe it's mispriced, people leave when they hit the price information button - whatever it may be."