“Data Cleansing: A Dichotomy of Data Warehousing?”
“Dirty data” is such a pervasive problem in every company, in every industry! But we have lived with “dirty data” for decades, so why is it such a problem now? Is it because we promise to deliver data warehouses with “clean, integrated, historical data in a short time frame for low cost”, yet we are unable to deal with the preponderance of “dirty data” within the framework of this promise? Some data warehouses are failing because the promised “clean”, “integrated”, “historical” data could not be delivered. Others are failing because the promised “short” time frame and “low” cost were exceeded in the attempt to clean up the data. In other words, they are failing because of the dichotomy of our promise.
How could we, the consciencious IT professionals, have allowed “dirty data” to happen in the first place? How could the users, the owners of the data, have allowed it to happen? Or, are there legitimate, sometimes even justifiable, reasons for the existence of “dirty data” in our legacy systems? And what should we do with it now?
The first blame for this “disgraceful” condition is usually put on the inadequate and sometimes non-existent edit checks of our operational systems. And the second blame immediately follows, pointing to sloppy data entry habits. Admittedly, a large percentage of “dirty data” is generated inadvertantly and not caught by the edit checks. But an equally large percentage ends up in our legacy files quite purposefully entered.
How many times have we seen dummy values entered, such as a social security number of 999-99-9999, or customer age as 999, or zip code as 99999? In these cases the existence of edit checks seems to be the culprit, because the data entry staff feels compelled to make up values where none are known. These values are easy to detect, but what about some of the more creative solutions to get around mandatory fields? What if the data entry person always used his or her own social security number or age or zip code?
In some instances we find dummy values that actually have meaning, for example a social security number of 888-88-8888 to indicate a non-resident alien status of a customer, or a monthly income of $99,999.99 to indicate that the customer is an employee. If you had to calculate the average monthly income of all your customers, the results would be flawed.
Absence of data
One common reason for “dirty data” is the absence of data. This is not always attributable to lazy data entry habits, but to the fact that different business units may have different needs for existence of certain data values in order to run their operations. The department which originates mortgage loans may have a federal reporting requirement to capture the sex and ethnicity of a customer, whereas the department which originates consumer loans does not. We must remember that the primary purpose for operational systems is to enable and to ease the day-to-day operation of the business, not to provide information for enhancing or enlarging the overall strategic business direction.
To make things even more interesting for the data warehouse developers, the two loan departments in the above example probably share the same operational systems for originating or servicing their loans. This is not uncommon, especially when the operational systems are purchased packages and 80 % of the data needs are common to both types of loans. It is in the use of the 20 % of non-common fields where the creativity of an IT staff can often be measured.
Here we find the data fields that are used for multiple purposes, where the same data value in a field ends up meaning many different things depending on (1) which department entered it and (2) on specific values in one, or better yet, in several other data fields. Many of us have seen data groups or even entire records redefined and redefined and redefined again, as often as 25 times, with no documentation left to trace the history of the redefinitions.
Upon closer examination, the size and content of such records can be anything from a string of dates, redefined as a string of amounts, redefined as a mixture of text and numbers, redefined as … How does one end up initializing such “kitchen sink” records? With Low Values of course, or how about with Spaces. “No matter”, you may say, as long as the redefined fields are always populated properly according to the rules of the record type.
“Yes it does matter”, I say, because they never are!
But let’s not stop here. The reality is, that once we are up to 25 redefines, many of the first few redefined data values are no longer applicable, and in many cases no one is left to even remember what they had meant at one time. So, you launch a “clean up campaign” advising all users of the system that you will delete these bad values at some specified point in time. Much to your surprise, you get 2 panic phone calls and 1 puzzled e-mail response from people you did not even know were using this system. You just discovered that somebody had been using this field for yet another purpose to perform their business function without informing anyone about it, and removing the “bad values” will all but shut them down, and that somebody else had been misinformed about the 25 different meanings of the field and had been misreporting critical financial information based on these “bad values”.
We all know, of course, that there is more. As we were analyzing the data values in our operational file, we also discovered many “kitchen sink” fields in our “kitchen sink” record. Those infamous fields that are used for many purposes are also cryptic beyond recognition. This happened over the span of many years, as the meaning of the fields expanded, so did their values.
For example, let’s say that originally our hypothetical loan system tracked whether or not taxes were impounded on a mortgage loan. The code “I” may have been used to mean No, there is No [tax] Impound account. Let’s say that years later we started to impound for insurance. Since the code “I” was already being used for No Tax Impounds, “T” seemed like a “logical alternative” for No Insurance Impounds – all we’d have to remember is that they are both negative and switched. After a few more years we wanted to track certain FHA loans, which required no insurance. The code “T” was already being used to mean No, there is no insurance impound, but it had a different meaning from the new requirement, so the new code “F” was added. And, since we seem to be tracking “exceptions to payment processing” in this field, let’s add “E” as a catch all indicator for any and all exceptions. Oops, there are loans for which we need to know that neither Tax nor Insurance is impounded. Although we have not run out of the alphabet yet, let’s keep the field blank and write logic to interrogate the Impound Balance field, and if there is a value greater than ZERO, let’s check to see if an Impound trailer record exists, otherwise the Impound Balance field has a completely different meaning. Another few years go by, and … we all get the picture. WHY you may ask? Most certainly there was a logical reason at the time, but only those who worked on it 20-30 years ago know what it was.
As the weeks fly by and we are really getting into this “dirty data” analysis, we discover contradicting information between two or more fields. For example, a property address on our file shows a Texas zip code and New York as the city and state address. By this time, we don’t trust anything we see and decide to check out the street address. Lo and behold, there is no such street in New York! A more serious example might be that the property is encoded as a Single Family Residence and was reported to the authorities as such, but we show that rental income from 10 separate units was used during the appraisal.
Inappropriate use of address lines
Before we leave the subject of addresses, we all have seen addresses constructed as address line1, address line 2, 3, etc., where address line 1 was originally intended to be used for personal first, middle, last name, or a company name, possibly preceded with an asterisk to indicate it is a business address (bad idea!), address line 2 for street number, direction, name, etc. This is challenging enough to parse into separate data attributes for the data warehouse without running into the following gem:
Address line 1: ROSENTAL, LEVITZ, AT
Address line 2: TORNEYS
Address line 3: 10 MARKET, SAN FRANC
Address line 4: ISCO, CA 95111
Violation of business rules
And how about all the data that violates the business rules? For example, an adjustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the maximum interest rate, or where multiple interest rates are captured for fixed rate loans which only have one interest rate for the life of the loan.
Reused primary keys
One of the most critical “dirty data” issues the data warehouse developers encounter, is with reused primary keys. Since operational systems rarely store history beyond 90 or 180 days, primary key values are often reused. For example, a bank branch, we’ll call it branch number 84, which originated and is servicing over 1,000 mortgage loans totalling $800,000,000 closes, and transfers the servicing of these loans to branch 207. Within one year branch 84’s number is reassigned to a new branch which is located in a less affluent area and is starting to experience a high rate of foreclosures on its originated loans. Typically the affected 1,000 loans still show branch number 84 as the originating branch, and all of these loans are falsely credited to the performance of the new branch 84. At first glance and without further detailed analysis, it is not evident that the new branch 84 is not profitable.
An even more interesting twist exists when a physical branch has 2 or more identifiers. For example, a branch located at 10 Main Street may be identified as branch number 65 on the loan system and the investor system, but as branch number 389 on the checking system and the savings system. Another example is the case where one customer is identified by many different customer numbers. Many of us have experienced this situation with our bank or our telephone company where we get multiple mailings.
Data integration problems
Last but not least, we run into integration problems due to “dirty data”. These problems come in two flavors: Data that should be related, but cannot be, and data that is inadvertantly related but should not be. The latter happens most frequently when fields or records are being reused for multiple purposes. For example, banks customarily buy mortgage loans from other banks, and banks sell their own mortgage loans to various investors. If one operational loan sales system is used for both purchases and sales, and if the primary key of loan sellers can have the same values as the primary key of investors, and if sales are distinguished from purchases only through various flags or switches or certain value ranges in certain fields, it will require “jumping through some hoops”, i.e. extensive programming logic, to avoid associating a loan seller or an investor with the wrong type of transaction.
The far more serious problem is when data that should be related cannot be related. This is sometimes linked to the previous problem of non-unique primary keys, but more often it is due to the absence of any keys. A common example of this can be found in banks. All banks assign unique account numbers, but few banks assign unique customer number. For decades the customers have been associated with their accounts through a customer name field on the account record. When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, occasionally 2 or 3 customers have a joint account and all of their names are squeezed into one name field with rather cryptic results. Unless all this information is tracked manually, which it seldom is, especially if the customer has accounts in multiple branches, analyzing customer profitability is all but impossible.
Now that we have exposed these “dirty data” nightmares, let’s hold back on our urge to lynch the guilty, but let’s instead be fair to the users and the IT staff who creatively created this mess. In most cases this type of creativity was, and under the prevailing circumstances maybe should have been, rewarded. Seriously, what are the obvious alternatives? Change the files and the programs, of course! However, if the system is a purchased package or it’s a home-grown system over 30 year old, maybe even written in Assembler, this “simple” solution may neither be simple nor cost effective at all.
To Cleanse Or Not To Cleanse …
The first question we must ask is: CAN it be cleansed? The answer is often NO. There truly are situations where data does not exist and cannot be recreated regardless of the amount of manual or automated effort. There are other situations where values are so convoluted or found in so many disparate places with seemingly different and opposing meanings to the same fact, that any attempt to decipher such data may produce even more erroneous results, and it may be best to just leave it alone.
The next question is more difficult: SHOULD it be cleansed? Here too, the answer is often NO. For most of us in data management this answer smacks of heresy. What purpose would it serve to suck “dirty data” out of the operational systems and plunk it into a data warehouse as-is? Obviously, none. Clearly some data cleansing must occur. However, we all have to face the business reality of today, and the business expectations to deliver value added information in a relatively short time frame for a low cost. Although “short time frame” and “low cost” are subjective in each company and can represent a wide range of interpretation, it should be obvious that the time and effort it would take to create and test the long and complex logic to fix some of our worst “dirty data” cases would fall outside that range.
Once we decide which data should be cleansed, the question is: WHERE do we cleanse it? Do we clean up our operational data on the operational systems? Or, do we perform the cleansing transformations in our extract and load processes for the data warehouse? Usually our first reaction is to clean up our operational systems, and in some cases that can and should be done. However, all too often those who are using the operational systems for operational purposes do not need the data any cleaner than it is, and they resist any attempts to change it. And in many cases it may be legitimately too labor intensive, not cost effective, or simply impossible to do, and so we end up putting the burden of cleansing on our extract and load processes.
The final question of course is: HOW do we cleanse what can reasonably be cleansed? Can the data cleansing products on the market today handle a lot of the common data quality problems shared by most organizations? Quite obviously, the answer is YES. But are the cleansing products on the market capable of resolving all of these very complicated and very customized “dirty data” situations? Quite obviously the answer is NO. If you are truly serious about creating value added information and knowledge above and beyond the condition of your operational data, will you have to bite the bullet and write procedural code? The answer is definitely YES.
What are the steps?
So what is the right answer for your company? How do you decide which “dirty data” to cleanse and why? Who makes these decisions? Let’s start with more fundamental questions and see if the answers become apparent: Why are we building a data warehouse in the first place? What specific business questions are we trying to answer? Why are we not able to answer them today?
The answers to these fundamental questions must come from the users, not from IT. Certainly, the expertise and involvement of the IT staff will help the users identify, quantify, document, and analyze their business needs, but it is the users who decide why a data warehouse should be built, not IT.
Once the goals and objectives for the data warehouse have been established, and it is clearly understood what business questions cannot be answered today and why, it is IT’s responsibility to analyze the existing operational files and to locate, document, and report the discovered “dirty data” to the users.
Now the painful task begins in determining where to get the “biggest bang for the buck”. Users and IT together, must evaluate the tangible and intangible benefits for each business question that could be answered by a data warehouse that cannot be answered today. Together, they must understand how each “dirty data” case would prevent these business questions from being answered, and they must understand the effort involved to cleanse it.
Now the compromising begins.
If the benefits outweigh the costs of the effort, the data should definitely be cleansed. Now the decision must be made whether or not to make the necessary changes to the operational systems to (a) clean up the existing data and (b) to prevent future “dirty data” from being entered. All efforts should be made to improve the operational systems, unless that effort is so unreasonably high, for instance in the case of multi-volume historical tapes, or it simply cannot be done because the original data sources no longer exist. The reality is that, more often than not, the majority of cleansing ends up being done in the extract and load processes.
If the costs outweigh the benefits, another painful decision must be made: Should the “dirty data” go into the data warehouse as-is, or should it be left out? Again, users and IT together, must weigh any possible benefits that could be derived from including this data, dirty as it is, with any possible harm it could do, such as skewing results of an important trend analysis thereby rendering it useless, or worse, providing wrong information which leads to bad business decisions.
This brings us back to the very essence and also to the dichotomy of data warehousing: the promise to deliver “clean”, “integrated”, “historical” data in a “short” time frame for “low” cost. After we fully comprehend the “realistic” time frame and “realistic” cost required to achieve the first part of the promise, we realize that we have been mislead by the second part of the promise.
About the Author
Larissa Moss, founder and president of Method Focus, Incorporated, has been consulting and lecturing worldwide on the subjects of data management and data warehousing.
She can be reached at email@example.com