<< Back to article Print this page Loading page, please wait...

Sensible behaviours for nonsensical data

Sue Bushell (CIO)
07 December, 2004 13:24

Data quality is critical to the success of any enterprise application. Systems from business intelligence to customer relationship management are destined to fail without high-quality data - the "garbage in, garbage out" theory.

George Preston scratched his head as he contemplated the road accident data from NSW. Something about it just did not make sense . . .

A director of software company Prometheus Information, Preston was preparing some educational and promotional material for the HealthWiz product his company produces for the Commonwealth Department of Health and Ageing. The subject of road accidents seemed to offer fertile ground, so he had started with an age-standardized comparison of hospitalization rates across all states. Next, with no particularly surprising comparisons standing out he had moved on to prepare an age breakdown, expecting - as all anecdotal evidence leads us to believe - to find a peak in rates for young drivers. Yet while all other states were living up to those expectations, NSW was a distinctly different case.

Could there be something about the NSW P-plate system that explained why drivers in their 20s were hospitalized at half the rate that they are in other states? Perhaps, but this would not explain why elderly people and infants had hospitalization rates three to five times higher. Puzzled, Preston then turned to the geographic distribution of the rates for males and females, finding that the low hospitalization rates were a NSW-wide phenomenon applying to both sexes.

Concluding there must be something systematically different about how the data is collected or compiled in NSW, Preston tried further analysis, hoping this would reveal whether this was a systematic difference across the whole of NSW, and whether there might be some obvious reason for the phenomenon. There was not. Now, with nothing obvious standing out, he plans to seek guidance from someone in NSW Health over whether they can explain the difference.

"I'm working on trying to determine how much I want to rely on this information, so I'm looking for a measure of confidence of some kind," Preston says. "The data has got to have an internal consistency, and it should line up with known information, so that there's kind of external validity as well.

"How one uses that information really depends on whether you've got some kind of coherent explanation for how the information is actually being generated. If you don't have some mental model in your head, it's generally pretty hard to use information, I think. The really important point is that usually you're looking for some kind of signal in the midst of noise. You need to try and get a handle on what the difference is between the noise and the signal."

In some cases a statistical test can help distinguish signal from noise; in others the organization can adjust the data for known causal factors and what counts is to put all data on a comparable basis.

Ultimately, it is a matter of applying common sense, Preston says. "The important thing is to actually pick up on the signal - the presence of the signal - and then you can work harder to try and get a better handle on what the signal is so you can focus your efforts around that aspect of the data quality, without worrying about the rest of it."

To help alert those using its data to its degree of reliability or otherwise, the HealthWiz team has developed a warning system that can be set to trigger for any specific value, variable or category in the data. These warnings are authored in consultation with the data custodian who supplied the collection in question.

"I think in general you should be able to alert users to parts of the data that are stronger and weaker," Preston says.

Page Break

Limited Confidence

Few organizations can assert full confidence in the data they currently capture and maintain. PricewaterhouseCoopers's Global Data Management Survey 2004 found that only 34 percent of survey respondents (representing international Fortune 500 companies) were very confident regarding the quality of their data.

Everyone knows that relying on data of dismal quality can lead to all manner of harmful and unintended consequences, from the poorly filled customer orders that can cost an organization business to the weak financial record keeping than can end up sending directors to gaol. The danger for companies that cannot rate their level of data quality, and hence cannot decide whether to trust that data or not, is that they will fluctuate between either blindly accepting the data that leads them into flawed business decisions, or - having identified a serious data quality issue - refuse to accept the validity of the data for ever more and revert to basing all business decisions on gut feel and experience.

Luckily, given that high-quality data is sometimes simply unobtainable and sometimes simply too expensive to obtain, perfection is not always necessary for many planning activities in the enterprise and vital decisions can and must be made with dysfunctional data.

"I think in most decision making, certainly in business decision making, many of the strategic decisions are made on less than perfect data from the operational level," comments Andy Koronios, a professor with the School of Computer and Information Science at the University of South Australia.

In a recent piece called "You Can Make Good Decisions From Less-Than-Perfect Data", AMR Research vice president research Bill Swanton points out that while no one can argue high-quality data for transaction systems is not essential, the picture gets murkier once the organization moves beyond transactions. In these cases, Swanton argues, high-quality data is neither obtainable nor even necessary for many planning activities in the enterprise.

"Valuable decisions can and must be made with less-than-perfect data, and the sensitivity of important applications and decision tools to data quality must be well understood," Swanton writes.

Measuring data quality can be as much art form as science, Preston admits - there is a very creative, expressive component to such work as well as a metrical element.

Need to Integrate

One problem that repeatedly confounds many organizations is a failure to better integrate data for informed decision making, with too many strategic decisions being based on imperfect data from the operations area. For instance a manufacturing company might know the life expectancy of a particular machine but still need to decide whether to drive the machine at a particular level (be that 100 percent or somewhat less than that) in the face of the risk of degrading its life expectancy.

Koronios says data from embedded systems should not only be used in identifying and forecasting the health of the asset (whether the asset is going to fail or not) but also in informing design of the asset's future replacement: providing historical data about failure rates in the name of improved design, for instance. Further, Koronios says much of the data coming in from the manufacturing and asset management areas - along with that collected from embedded decisions - may or may not be dumped into a data warehouse. Either way the organization is faced with having pools of data available that do little to inform decision making at either the business or design level.

"There is so much data actually being pumped into systems these days that at the management level the bandwidth of the individual cannot actually handle the amount of data unless it's processed in the right way: highly-processed data, highly-processed information so that in fact only the relevant data - the data for management requirements - actually reaches the strategic decision makers," Koronios says. "The operational data remains with the operational people.

"Very little of that is happening in organizations. At the moment we are doing some work with large organizations in Australia, and they have exactly that particular problem. In other words, yes, lots of information comes from the shop floor, lots of information is pumped into data warehouses, lots of information is pumped into ERP systems, yet there's a disconnect between that and the people actually making the strategic decisions about the organization."

The answer, according to Koronios, is better integration and finding enhanced ways to summarize, synthesize, analyze and visualize data, although he admits that while much work has been done in systems integration there is still much more to be done. To this end the University of South Australia is a member of a Cooperative Research Centre (CRC) joint effort in conjunction with major organizations in Queensland and NSW designed to address the issues. The main complaint of the organizations, he says, is that they lack the correct data - the precise data - needed to make informed decisions.

"It appears to me that many of the CEOs and below at the strategic level are actually making decisions on imperfect data every day," Koronios says.

Page Break

Where the CIO Comes In

In Koronios' mind, the CIO has a critical responsibility to ensure the organization has the infrastructure required to enable people to ask the right questions and to come up with meaningful answers.

For instance, when executives look upon the data warehouse as a panacea, as many are wont to do, it is the CIO's role to take the perspective of the strategists and instead focus on what the organization needs by way of data, what it wants to achieve with that data, and on identifying the data required to support those aims. In other words, it is up to the CIO to ensure design of the data warehouse is business-driven, rather than technology-driven.

It can also be the CIO's role to quantify a confidence level for the data. He or she must be able either to give the strategist confidence that the data is accurate, or else to keep them fully informed about the data's flaws, Koronios says. "In other words, for the strategist, the CIO needs to know how good the data is; needs to have some measure of confidence about the data on which they're making a decision. So that if they still wish to make that decision on a 50:50 chance, at least it's an informed kind of decision that they're making about the data that they have in front of them."

In this role as quality control/quality assurance agent, it is up to the CIO to be alert to "stuff ups" as data makes its way through the organization.

"The CIO needs to be alert to the kind of things that can go wrong, and have processes in place to have quality assured that the data that's reaching the analyst is a faithful representation of the data in the warehouse," Preston says. "When you don't know whether it's the data that's the problem or what you've done to it that's the problem, that can be a little bit tricky and I think one needs a range of strategies to deal with that."

One is to reference back to the warehouse, to attempt some kind of validation processes. Another is to conduct face value analysis: does the data look right, based on knowledge of the business?

Finally comes analysis, to check for likely causes of error and to work out where data collection may have failed.

The Time to Rework

Steve Neilson, an independent consultant who has been involved in building data warehouses in the public sector for the past 10 years, says if there is bad data in the system, at some stage it will be up to the CIO to rework it.

"You either rework the decision you've made because the decision was bad, and that's very expensive, or you rework the data collection process, which is just plain expensive and double handling of everything. And that introduces all sorts of extra costs into your overall process, which is dead silly. And there are engineering analogies, because what you're saying is: 'We know this machine is producing crappy output, or the operator is producing crappy output, we're prepared to put up with it simply because it's data.'"

Bad data costs real money and the starting point to addressing that fact is statistical analysis, or data quality metrics, Neilson says.

"First you've got to measure it. You've got to know whether it's 5 percent bad, or 10 percent bad, or 100 percent bad. Is 5 percent less bad than 10 percent? Well, it all depends. In some data fields in databases it doesn't matter if you have 50 percent errors because no one cares about them. In other fields - say fields that you're using to feed a Balanced Scorecard or something - you need to know what the error rate is when you get the Balanced Scorecard. And when a manager makes a decision based on that data he needs to know there is an error level in there. So if it was 5 percent error, I think most managers would be comfortable with making decisions on that. If it was a 50 percent error, I think they would be most uncomfortable with it and might have to take some action to do some statistical analysis, work out where the process is going wrong, and fix it."

Some data is more important than other data, with each organization having its own focus. For some organizations, date of birth may need to be absolutely accurate, for others it might be postal address, and others might need special operations data that they can rely on 100 percent. Who is the best person to make these decisions? The businessperson, not the technical people, Neilson says. The data quality analyst cannot say how many errors are acceptable; only the businessperson can.

A good place for the CIO to start is by using metrics to divide data into three categories: missing, invalid and valid. Under this measure, anything identified as neither missing nor invalid is considered by default to be valid. This allows the business to identify quickly those data items that it can afford to "allow" to have missing values. By encouraging the business to insist that the meaning of data be identified, the CIO can thus introduce the concept of metadata into the equation, Neilson says.

Page Break

Once the metadata starts flowing from the analysis, it can be gathered and stored in a metadata repository purpose-built to capture and reuse such information. Neilson says valid values analysis will also harvest more metadata. The CIO can then analyze the valid data by dividing it into two categories: right values and wrong values. This encourages the organization to formulate further business rules delineating which values are right and wrong, which in turn creates even more highly useful metadata.

"But it's all very well focusing on the data; the data is innocent," Neilson points out. "It's the process that causes the problem. Data is just a number that someone has keyed in. It's the processes that surround that that count. Some of the data quality exercises that we have done have revealed errors where people have gone in and fixed the data. You know: their IT people say: 'There's a problem here, we'll write a program to fix the data', and of course they really fix it - well and truly fix it. So that's one source of errors. Now the obvious other source of errors is just plain keying errors - you know, key in someone's birth date: there's almost no check you can do on that except reasonableness checks - it doesn't know whether it's accurate or not."

Neilson offers an example on the important of process. Take a situation where the organization collects data at the beginning of the process where a 10 percent error rate is deemed perfectly acceptable, yet by the time that data progresses through to organization and ends up on a Balanced Scorecard, the manager considers a 10 percent error rate atrocious. What is good enough for the start of the process might not be good enough for the end of the process, Neilson points out.

When yields of processes vary day to day, introducing potential inventory inaccuracies; or when measurements of product attributes, like thicknesses or weights, vary within a normal distribution, and techniques like statistical process control (SPC) are required to determine whether the process is shifting out of control; or when "nervous" schedules - a classic problem in supply chain planning, where small variations in timing or yield create large changes in the proposed schedule - cause uncertainty; when any of these outcomes are possible, Swanton says treating the numbers as fixed constants that can be used in calculations is dangerous, unless the nature of the variability is understood and accounted for.

"Resilient, self-correcting algorithms are necessary to handle noisy data smoothly," Swanton writes. "All data isn't created equally, and can't be simply dumped in a data warehouse and used without understanding its characteristics."

According to Swanton some approaches that will help include the following:

• measure the quality of your data sources to understand the accuracy, coverage and normal variation to be expected

• for factual transaction data, use performance management techniques to improve processes and personnel to raise quality to a very high level; this is the minimum to meet customer satisfaction goals and statutory obligations

• find and characterize other useful data sources; understand that these are different from transactional data and must always be used understanding how the variations from different sources interact to amplify or average out the variations

• investigate the use of simple, self-correcting algorithms for most planning activities; calculating an optimum based on noisy, incomplete data is foolhardy.

Page Break

SIDEBAR: Shape It Up Before You Ship It Out

As a vice president of a large college, George Kahkedjian, chief information officer Eastern Connecticut State University, has provided leadership for Institutional Research (Columbus State Community College, 1997-2002) and developed processes for collection, storage, maintenance and data retention strategies.

Kahkedjian says the best way to approach the data quality discussion is from the broader context of an information hierarchy where the bottom of the pyramid is the data that comes from multiple sources (medical, financial, education, personal and so on), the middle is information (so if the data is not correct or correctly related to each other, there is possibility of working with the wrong information) and the top is knowledge.

Most of the current discussions about information and knowledge rely on the data quality/integrity that supports these higher levels because data quality has context: in order for us to make the correct decisions, the data have to be correct, timely, available and maintained, otherwise it is not useful and impacts our information and knowledge.

If the data is not accurate, Kahkedjian says it creates major problems in information systems. How can the CIO ensure that the data is accurate?

First, input the right data into the right system

Example: if you put my wrong blood type, it is a problem.

Second, maintain the data by updating it at appropriate intervals

Example: if I develop an allergy condition and my medical records are not updated, it is a problem.

Third, remove unnecessary data from the system (retention)

Example: after one year of credit card transactions, perhaps the data should be removed.

Fourth, use the correct data for the correct task

Example: use my medical records for medical purposes, not employment.

Fifth, make the data available when appropriate on a timely basis

Example: there is an emergency, but the right individuals cannot access it.

SISDEBAR: The Right Stuff

by Mary Brandel

Data stewards work with the IT and business groups to improve data quality and standardization

Acustomer is a customer is a customer, right? Actually, it's not that simple. Just ask Emerson Process Management, an Emerson Electric unit in Austin that supplies process automation products. Four years ago, the company attempted to build a data warehouse to store customer information from over 85 countries. The effort failed in large part because the structure of the warehouse couldn't accommodate the many variations on customers' names.

For instance, different users in different parts of the world might identify Exxon as Exxon, Mobil, Esso or ExxonMobil, to name a few variations. The warehouse would see them as separate customers, and that would lead to inaccurate results when business users performed queries.

That's when the company hired Nancy Rybeck as data administrator. Rybeck is now leading a renewed data warehouse project that ensures not only the standardization of customer names but also the quality and accuracy of customer data, including postal addresses, shipping addresses and province codes.

To accomplish this, Emerson has done something unusual: It has started to build a department with six to 10 full-time "data stewards" dedicated to establishing and maintaining the quality of data entered into the operational systems that feed the data warehouse.

The practice of having formal data stewards is uncommon. Most companies recognize the importance of data quality, but many treat it as a "find-and-fix" effort, to be conducted at the end of a project by someone in IT. Others casually assign the job to the business users who deal with the data head-on. Still others may throw resources at improving data only when a major problem occurs.

Page Break

Creating a data quality team requires gathering people with an unusual mix of business, technology and diplomatic skills. It's even difficult to agree on a job title. In Rybeck's department, they're called "data analysts", but titles at other companies include "data quality control supervisor", "data coordinator" or "data quality manager".

At Emerson, data analysts in each business unit review data and correct errors before it's put into the operational systems. They also research customer relationships, locations and corporate hierarchies; train overseas workers to fix data in their native languages; and serve as the main contact with the data administrator and database architect for new requirements and bug fixes.

The analysts have their work cut out for them. Bringing together customer records from the 75 business units yielded a 75 percent duplication rate, misspellings and fields with incorrect or missing data.

"Most of the divisions would have sworn they had great processes and standards and place," Rybeck says. "But when you show them they entered the customer name 17 different ways, or someone had entered, 'Loading dock open 8:00-4:00' into the address field, they realize it's not as clean as they thought."

Although the data steward may report to IT, it's not a job for someone steeped in technical knowledge. Yet it's not right for a businessperson who's a technophobe, either.

What you need is someone who's familiar with both disciplines. Data stewards should have business knowledge because they need to make frequent judgment calls. Indeed, judgment is a big part of the data steward's job - including the ability to determine where you don't need 100 percent perfection.

Data stewards also need to be politically astute, diplomatic and good at conflict resolution - in part because the environment isn't always friendly.

There are many political traps, as well. Take the issue of defining "customer address". If data comes from a variety of sources, you're likely to get different types of coding schemes, some of which overlap. "Everyone thinks theirs is the best approach, and you need someone to facilitate," says Robert Seiner, president and principal of KIK Consulting & Educational Services in Pittsburgh.

People may also argue about how data should be produced, he says. Should field representatives enter it from their laptops? Or should it first be independently checked for quality? Should it be uploaded hourly or weekly? If you have to deal with issues like that and "you're argumentative and confrontational, that would indicate you're not an appropriate steward", Seiner says.

Most of all, data stewards need to understand that data quality is a journey, not a destination. "It's not a one-shot deal - it's ongoing," Rybeck says. "You can't quit after the first task."

SIDEBAR: When Good Data Goes Bad

by Barry Solomon

IT professionals are increasingly instituting processes and procedures to ease the task of maintaining high-quality data. Some of the most effective ideas include the following:

Build in change management rules. A centralized enterprise system allows all users to contribute to the database. However, not all enterprise data should be blindly updated, and not all users are careful about their changes. As a result, IT professionals should establish submission and review processes that let them filter which user changes are saved to the centralized repository.

Increasingly, "data stewards" - users tasked with maintaining data quality - are establishing change management rules by which they can discern good changes from bad ones. User changes to data are routed to the data steward in the form of a "ticket"; the data steward can then evaluate the ticket to determine whether to accept the modification. For instance, if a low-level CRM user changes the corporate name of the company's top customer, this should serve as a red flag to the data steward to double-check this edit prior to accepting the change.

Without a submittal and review process, the only way to prevent end users from directly updating the central database is to lock certain fields. Two significant problems can result. First, end users become frustrated when they aren't able to make changes that they need and that they know should be made to contact information. Second, those managing the central database miss out on getting update information from those end users in the best position to know that information has changed.

Establish workflow processes. If change management rules are instituted, workflow processes should also be thought through so as not to bog down the data steward with unimportant change requests. For instance, some changes are less prone to error and therefore less suspect - such as changes to e-mail address information. A data steward might establish a rule enabling such changes to be saved immediately to the centralized repository. Other changes, such as those to company names and titles, are critical and might be rejected pending verification of the change. The workflow process should empower the data steward to take appropriate action and move on to the next change ticket.

Categorize and prioritize data. The frequency and quantity of changes to data that occur within an enterprise system are staggering. Even with rules and processes in place to manage modifications, a data steward could still be easily overwhelmed with change tickets.

To prevent this, companies must have a system to categorize and prioritize which changes warrant the oversight of the data steward and which don't. Not all changes should be submitted to a data steward for processing. Depending on the importance of the data and the type of change, it's perfectly acceptable to allow certain changes to be completed and to perform the data quality review later and en masse.

For instance, in a CRM system, changes made to contacts at a company's top customers have much greater effect on the organization than changes to contacts of noncustomer vendors. Also, changes made by certain trusted users may not need to be reviewed by a data steward.