Yet Another Piece On Data Quality
By Sid Adelman
All right, you’re tired of reading about and listening to lectures on the importance of data quality. But yet, your organization hasn’t done much about it. This column is another shot that will allow you to take the message to management because without, first their understanding and then their commitment, nothing will happen of any significance. I’ve tossed in a couple of points on the cost of poor quality that should capture their attention. OK, here goes.
What Is Data Quality?
There are a number of indicators of quality data.
- The Data Is Accurate – This means a customer’ name is spelled correctly and the address is correct. If the Marketing Department doesn’t have the correct profile for the customer, Marketing will attempt to sell them the wrong products and present a disorganized image of the organization. When data on a company vehicle is entered into the system, it may be valid (a vehicle number that is in the database), but it may be inaccurate (the wrong vehicle number).
- The Data Is Stored According To Its Data Types
- The Data Has Integrity – The data will not be accidentally destroyed or altered. Updates will not be lost due to conflicts among concurrent users. Much of this is the responsibility of the DBMS, but proper implementation of the DBMS should not be assumed. Robust backup and recovery procedures as implemented by the installation are needed to maintain some levels of integrity. In addition, operational procedures that restrict a batch update from being run twice are also necessary.
- The Data Is Consistent – The form and content of the data should be consistent. This allows for data to be integrated and to be shared by multiple departments across multiple applications and multiple platforms.
- The Databases Are Well Designed – A well designed database will perform satisfactorily for its intended applications, it is extendible, and it exploits the integrity capabilities of its DBMS.
- The Data Is Not Redundant – In actual practice, no organization has ever totally eliminated redundant data. In most data warehouse implementations, the data warehouse data is partially redundant with operational data. For certain performance reasons, and in some distributed environments, an organization may correctly choose to maintain data in more than one place and also maintain the data in more than one form.
The redundant data to be minimized is the data that has been duplicated for none of the reasons stated above but because:
- The creator of the redundant data was unaware of the existence of available data.
- The redundant data was created because the availability or performance characteristics of the primary data were unacceptable to the new system. This may be a legitimate reason or it may also be that the performance problem could have been successfully addressed with a new index or a minor tuning effort and that availability could have been improved by better operating procedures.
- The owner of the primary data would not allow the new developer to view or update the data.
- The lack of control mechanisms for data update indicated the need for a new version of the data.
- The lack of security controls dictated the need for a redundant subset of the primary data.
In these cases, redundant data is only the symptom and not the cause of the problem. Only managerial vision, direction, and a robust data strategy would lead to an environment with less redundant data.
- The Data Follows Business Rules – As an example, a loan balance may never be a negative number. This rule comes from the business side and IT is required to establish the edits to be sure the rule is not violated.
- The Data Corresponds To Established Domains – These domains are specified by the owners or users of the data. The domain would be the set of allowable values or a specified range of values. In a Human Resource System, the domain of sex is limited to “Male” and “Female.” “Biyearly” may be accurate but still not an allowable value.
- The Data Is Timely – Timeliness is subjective and can only be determined by the users of the data. The users will specify that monthly, weekly, daily, or real-time data is required. Real-time data is often a requirement of production systems with on-line-transaction processing (OLTP). If monthly is all that is required and monthly is delivered, the data is timely.
- The Data Is Well Understood – It does no good to have accurate and timely data if the users don’t know what it means. Naming standards are a necessary (but not sufficient) condition for well-understood data.
Data can be documented in the Data Dictionary/Repository, but the creation and validation of the definitions is a time consuming and tedious process. This is, however, time and effort well spent. Without clear definitions and understanding, the organization will exhaust countless hours trying to determine the meaning of their reports or draw incorrect conclusions from the data displayed on the screens.
- The Data Is Integrated – An insurance company needs both agent data and policyholder data. These are typically two files, databases, or tables that may have no IT connection. If the data is integrated, meaningful business information can be readily generated from a combination of both the agent and policyholder data.
Database integration generally requires the use of a common DBMS. There is an expectation (often unfulfilled) that all applications using the DBMS will be able to easily access any data residing on the DBMS. An integrated database would be accessible from a number of applications. Many different programs in multiple systems could access and, in a controlled manner, update the database.
Database integration requires the knowledge of the characteristics of the data, what the data means, and where the data resides. This information would be kept in the Data Dictionary/Repository.
An integrated database would have the following potential benefits:
- Less redundant data
- Fewer possibilities for data inconsistency
- Fewer interface programs (a major resource consumer)
- Fewer problems with timing discrepancies
- More timely data
- The Data Satisfies The Needs Of The Business – The data has value to the enterprise. High quality data is useless if it’s not the data needed to run the business. Marketing needs data on customers and demographic data, Accounts Payable needs data on vendors and product information.
- The User Is Satisfied With The Quality Of The Data And The Information Derived From That Data – While this is a subjective measure, it is, arguably, the most important indicator of all. If the data is of high quality, but the user is still dissatisfied, you or your boss will be out of a job.
- The Data Is Complete –
- There Are No Duplicate Records – A mailing list would carry a subscriber, potential buyer, or charity benefactor only once. You will only receive one letter that gives you the good news that “You may already be a winner!”
- Data Anomalies – From the perspective of IT, this may be the worst type of data contamination. A data anomaly occurs when a data field defined for one purpose is used for another. For example, a currently unused, but defined field is used for some purpose totally unrelated to its original intent. A clever programmer may put a negative value in this field (which is always supposed to be positive) as a switch.
An important set of information to be included in design reviews is the requisite quality of the data under consideration and the actual state of the data. The basic question to be asked is “How clean, timely, etc. must the data be?” In the design review, the team members would consider the data source, the process of update and delete, and the quality controls imposed on those accessing the data.
The Design Review would review and validate that standards are being followed. The review process may make recommendations to clean up the data, establish strict controls on shared updating, and assure sufficient training for users who would query the data.
Assessment Of Existing Data Quality
As people overestimate the intelligence of their grandchildren and the sweet nature of their dogs, organizations overestimate the quality of their own data. A reality check is generally needed. Poor quality data can be detected in a number of ways:
- Programs that abnormally terminate with data exceptions.
- Clients who experience errors or anomalies in their reports and transactions and/or don’t trust their reports or don’t trust the data displayed on their screens.
- Clients who don’t know or are confused about what the data actually means.
- On-line inquiry transactions and reports that are useless because the data is old.
- Data that can not be shared across departments due to lack of data integration.
- Difficulty for clients to get consolidated reports because the data is not integrated.
- Programs that don’t balance.
- In the consolidation of two systems, the merged data causes the system to fail.
Quality may be free but data quality does require an initial investment. It takes people and resources to bring data to the desired pristine state. If data is allowed to remain in its current (dirty) state, there may be a substantial cost and disruption to the organization. Very few organizations understand the costs and exposures of poor quality data.
Impact Of Poor Quality Data
Data is an asset but it can only be an asset if the data is of high quality. Data can also be a liability if it is inaccurate, untimely, improperly defined, etc. An organization may be better off not having certain data than having inaccurate data, especially if those relying on the data do not know of its inaccuracy. A hospital would be better off not knowing a patient’s blood type than believing and trusting it to be “O+.”
Which Data Should Be Improved?
It should be obvious that it’s impossible to improve the quality of all the data in an installation. The prioritization is much like triage. The energy should be spent on data where the quality improvement will bring an important benefit to the business. Other criteria that would suggest data improvement is data that can be fixed and kept clean. Unimportant data can be ignored. Data that will become obsolete can also be bypassed. Examples are:
- The business will be bought
- The data will be converted because of a new application
- A reengineering of the business will cause the certain data to be retired
If the Marketing Department is reviewing the demographics of their customers, the zip code (as part of the address) is important while the rest of the address is less critical.
There will be wide variations in the costs to clean different files and databases. This will enter into any decision about which data to purify. The cost of perfectly accurate data may be prohibitive and may not be cost effective. Based on the source of the data, accuracy may also be impossibility.
Users of data may be willing to settle for less than totally accurate data. Even so, it is important that the users know the level of quality they are getting. A greeting card company asked their retailers to measure the number of linear feet devoted to that company’s card products. Those who analyzed the data knew the data to be inaccurate but preferred inaccurate data to no data at all. A large computer manufacturer asked their marketing representatives and technical engineers to report on how they spent their time. It was well known that the respondents were not keeping very good records themselves and their reports reflected the lack of their concern for accuracy. Those who analyzed the data knew of the inaccuracies but were looking for trends and significant changes to indicate shifts in how jobs were being performed. The inaccurate data, in both of these cases, was acceptable.
To clean up the data, the following steps should be followed:
- Determine the importance of data quality to the organization.
- Assign responsibility for data quality.
- Identify the enterprise’s most important data.
- Evaluate the quality of the enterprise’s most important data.
- Determine users’ and owners’ perception of data quality. – Users will convey their understanding of the data’s quality and will often indicate why the data has problems.
- Prioritize which data to purify first.
- Assemble and train a team to clean the data.
- Select tools to aid in the purification process.
- Review data standards.
- Incorporate standards in the application development process to ensure that new systems deliver high quality data.
- Provide feedback and promote the concept of data quality throughout the organization.
Roles And Responsibilities
The creation and maintenance of quality data is not the sole province of any one department. The responsibility touches Application Developers, Database Administrators, Data Administrators, Quality Assurance, Data Stewards, Internal Auditors, Project Managers, and most importantly, senior management. The importance of quality data must be understood by senior management and expressly communicated throughout the organization. Words are not as important as deeds. When quality measures appear in performance plans, reviews, and bonuses, people finally believe that quality is important. It is equally important that time and resources be allocated to development schedules to support management’s commitment to quality.
Impact Of Data Quality On The Data Warehouse
Bad data should never be allowed into the data warehouse unless the problems are recognized and acknowledged by those who will use the data. Whenever possible, the data should be validated and purified prior to extraction. If bad data enters the data warehouse, it may have the effect of undermining the confidence of those who access the data. Clients and IT must be able to rely on the data, regardless of whether it is detailed, summarized, or derived.
The effort to clean up data once it is in the data warehouse becomes a major and never-ending task. It should not be the responsibility of those administering the data warehouse to clean up bad data. The cleanliness standard puts an additional burden on the stewards of the data to perform validations of the source data.
Assessing The Costs Of Poor Quality Data
It will be difficult to assign real dollars to most of these categories. If estimates in real dollars are possible, conservative numbers should always be used. When an organization has experience with any of the following problems and if the costs of fixing those problems have been calculated, those figures can be assigned.
- Bad decisions due to incorrect data.
- Lost opportunities because the required data was either unavailable or was not credible.
- Time and effort to restart and rerun jobs that abnormally terminated due to bad data.
- In a buyout situation, accepting too low a price for your business because you cannot properly demonstrate your business potential, or your business seems to be in disarray because your reports are inconsistent.
- Fines imposed by regulating authorities for non-compliance or violating a governmental regulation as a result of bad data.
- Time and resources to fix a problem identified in an audit.
- Hardware, software, and programmer/analyst costs as a result of redundant data.
- The costs and repercussions of bad public relations due to bad or inconsistent data. (Ex. A public agency unable to answer questions from the press or from their Board of Directors.)
- Time wasted by managers arguing and discussing inconsistent reports which are the result of bad data.
- Poor relations with business partners, suppliers, customers, etc. due to overcharging, underpayment, incorrect correspondence, shipping the wrong product, etc.
- The time spent correcting inaccurate data. These corrections may be performed by line personnel or by IT.
- The costs of lost business in operational systems because of poor quality data (data was wrong or non existent). An example is the lost marketing opportunity for an insurance company that does not have accurate information about a client and thus loses the opportunity to market an appropriate insurance product.
Data Quality Feedback To Senior Management
Unlike measurements of performance and availability, the quality of data will not be changing daily. Quality can, however, be quickly compromised by operating procedures that cause improper batch updates. Those responsible for data will want to make periodic checks to determine trends and progress in improving data quality. The results should be reported to IT management and to the departments that own the data.
The quality of data can be measured, but before any measurement takes place, the following questions should be answered:
- Why is the quality of the data being measured? – The classic answer is that without measurement, management of the data is impossible.
- What is being measured? – Some possibilities include:
- trends, i.e. is the data getting cleaner or dirtier?,
- user satisfaction with the quality of the data
- What will be done with the measurements? – Some possibilities include: 1) focus on the data that needs to be purified, 2) provide a basis for cost justifying the purification effort, and 3) give information for prioritizing the cleanup process.
This column identified various categories of data quality, discussed how to identify data quality problems and how to address those problems. The column gave suggestions for incorporating data quality topics in design reviews. Roles and responsibilities were discussed. Also addressed was data quality as it impacts the data warehouse and the necessity of bringing senior management into the picture.
Data is a critical asset for every enterprise. The quality of the data must be maintained if the enterprise is to make effective use of this most important asset. Improvements in data quality do not just happen; they are the result of a diligent and on-going process of improvement.
About the Author
Sid Adelman is a principal consultant with Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses, performing data warehouse and BI assessments, and in establishing effective data strategies. He is a regular speaker at “The Data Warehouse Institute” and IBM’s “DB2 and Data Warehouse Conference”. Sid chairs the “Ask the Experts” column on www.dmreview.com, and has had a bi-monthly column in DMReview. He is a frequent contributor to journals that focus on data warehousing. He co-authored one of the initial works in data warehousing, Data Warehousing, Practical Advice from the Experts, and is co-author of Data Warehouse Project Management with Larissa Moss. He is the principal author of Impossible Data Warehouse Situations with Solutions from the Experts and his newest book, Data Strategy, was co-authored by Larissa Moss and Majid Abai. He can be reached at 818 783 9634and firstname.lastname@example.org. His web site is www.sidadelman.com.