Avoiding the perils of ‘rogue analytics' with a new approach to data blending

Big data value is found by combining data from an assortment of both new and established sources.

Comments

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

A plethora of technological developments make it easier than ever to gather information from new kinds of data sources and machines, including websites, applications, servers, networks, sensors, mobile devices and social networks. But the real value of big data is not the data itself, but rather the whole new world of insights that emerge when combining data from an assortment of both new and established sources.

Organizations making decisions based purely on established relational data sources will likely lose marketshare to companies that can capitalize on all their available data sources. Put another way, it's the ability to blend data from operational business applications with data from social networks, sensors and weblogs that really opens up competitive advantage.

Blending different types of data sources helps companies better understand their customers and in turn, design targeted services and experiences resulting in greater service levels, loyalty and profitability.

+ ALSO ON NETWORK WORLD 10 Microsoft Power BI for Office 365 success stories +

Data blending offers huge potential for established players. They have mountains of historical data, while constantly collecting large amounts of new data through various channels and systems. However, it can be tough to adapt traditional information structures, which have enterprise data warehouses (EDW) at their center, into this brave new world.

Why? Because it simply doesn't make sense to move big data into an EDW and analyze it in the same way we would with relational data. The structural variety and volume make it extremely impractical and time consuming, defeating the possibility of gaining insights in anything near real time. Furthermore, the economics are prohibitive due to the high volumes involved.

As a result, we are rapidly moving into an era of distributed data architectures, where data remains housed in the type of store most optimal for its volume and variety. In this distributed approach, the traditional EDW hosts data from enterprise applications such as CRM and ERP systems, plus more agile big data infrastructures such as Hadoop and NoSQL.

Tackle the need for speed

In this big data era, the pace of business analysis and response is speeding up. While it still might make sense to load data from enterprise applications on a daily or weekly basis into the EDW, many times businesses cannot afford to wait for data to be extracted, merged, cleansed, transformed and stored before it can be analyzed.

As a consequence, new information architectures that manage the flow of data are being constructed differently than before. Agile analytics requires data to be accessible at the source to ensure it's based on the most up-to-date information possible. Data must be blended where it lives, across disparate stores and structures, while ensuring performance and ease of use across the breadth of analytics historical to operational to predictive.

When it comes to working with relational data, speed, quality and integrity need to be guaranteed. Otherwise, flawed data stays flawed with the resulting analytics having the potential to land businesses in perilous situations. As such, it makes sense to execute data blending at the origin where data integration is taking place, rather than during analysis by the end-user "at the glass".

Letting end users or analysts do their own data blends at the glass comes with three significant disadvantages: First, the data isn't captured at the source, so it's dated by definition. It is not then suitable to support decisions that involve reacting to critical events as they arise. Second, end users don't usually understand the underlying data semantics, causing blends to result in negative impacts on data governance and security. Third, inaccurate results with disastrous business consequences are more likely.

Here's one example of what could go wrong if an analyst who is blending two sources at the screen matches two fields, both named "revenue" to records that match on the field "customer". The analyst does not realize that one is a monthly sum total and the other is a daily total, since the blending is executed based on identical field names. In this case, the analyst would add the two together to calculate the day's total revenue from that customer.

However, by adding the monthly figure into each day's total, it has greatly distorted the actual revenue generated from that customer. The business might then decide to target that customer as highly profitable and offer significant discounts to maintain interest. Not only have they targeted the wrong customer (and potentially ignored the real profitable customers), they've also given undeserved discounts. The net result is that this customer becomes less profitable, and other more discount-deserving customers lose out.

A data blending fairy tale?

Data blending at the source is the most efficient and reliable process for combining different data sources. Data is blended during the transformation phase of the Extract-Transform-Load (ETL) process. Using an advanced data integration tool, the transformation can be executed in an automated SQL environment, rather than manually designed transformations even though the data comes from different sources like NoSQL, Excel, XML or web services.

There are several advantages to this. First, it's easier to blend data without having to be familiar with all the different source programming languages. Secondly, most business analytics tools use SQL by default. This means the data captured from the different sources and blended in a SQL transformation can be made accessible to any front end for reporting or analysis.

Blending at the source during the data integration process also ensures the underlying semantic remains intact and that the blend is not only available but also trackable. Statistics from the blends can be logged and viewed in the same way as with any other ETL query. This offers insights into which data has been blended and how often, avoiding a blending nightmare or what we refer to as rogue analytics'. This is particularly important for organizations with high standards for data governance and security and those with real-time information needs.

So while big data analytics is not a fairy tale answer to all enterprise obstacles, there are new, practical and proven ways to manage and ideally profit from its enormous potential.