A Foundation for Data Quality – Guest Author: James Funk

By Richard Wang

Last month we continued our discussion by approaching the topic of the characteristics of data that impact its overall quality. This month we will look at the characteristic that is considered to be the foundation for good data quality. It is also the traditional starting point for any continuous improvement program. That characteristic or data quality dimension is what is commonly termed “accuracy”. It is one of the data quality dimensions that is categorized as being intrinsic to the data itself. A piece of data is accurate if it reflects the true nature of the object, event, or concept that it represents. All customers want the data they need and use to be correct. They want the data values they are using to agree with the event that has occurred or is occurring, to properly reflect physical characteristics of an object such as its height, width, length and weight, or to properly record the financial condition of the organization. Executives expect that the data being used by the organization is accurate.

People have long advocated that the best way to measure data quality is to measure its accuracy. In some instances this is easy to accomplish. Let’s use the example of shipping goods from a manufacturer to a retailer. The size and weight of the goods being shipped is important because of transportation regulations and taxes, storage space restrictions, and the correct execution of self-scanning check-out processes within retail stores. In this case it is relatively straightforward for all parties to check the accuracy of this data. Each can independently measure one or more instances of the physical object to determine if the data they and their computers are using is correct. It is also easy to quantify the inaccuracy and to communicate occurrences of poor data quality. Another example would be the date of birth for an employee. While there is no direct measurement that can be applied for this data value, the employee can review the data and indicate if it is recorded correctly.

There are instances where determining the accuracy of the data is more difficult. If the data represents historical facts, there may be no physical evidence with which the data can be corroborated. In some instances, it may be possible to have more than one correct value. Let’s use the example of the term “net profit”. There currently are two standards used for determining net profit. The United States uses the Generally Accepted Accounting Principles (GAAP) and the rest of the world uses the International Financial Reporting Standards (IFRS). Differences between the two accounting systems could make it difficult for investors to compare companies, even firms in the same industry. Under U.S. GAAP, research and development costs, for example, are generally expensed when they occur. Under the international standards, once a project gets to the development stage, costs are spread out over time. The upshot is that a company could show different operating income and net profit depending on which system they use. Currently foreign organizations filing financial reports in the United States must reconcile the IFRS data with GAAP and highlight the differences. Under the current rules and reporting restrictions there are two different values for “net profit” that are correct. The Securities and Exchange Commission in the United States is proposing that non U.S. companies can file their financials using IFRS rules rather than GAAP. They are also looking at allowing U. S. companies to file their reports using the IFRS rules. This change will have profound impact on the quality of the data being used. The situation or condition represented by the term “net profit” could actually be two different conditions. If an organization were marginally profitable, the difference could result in reporting a profit or a loss. Both representations would be accurate. Interpretation of the results could be vastly different and lead to different actions by users of the data.

Another example of the challenges associated with accuracy involves the difference between accounting data and data used to record specific activity within an organization. When a manufacturing firm was developing an addition to its global data warehouse involving the purchasing of raw materials on a global basis, it tried to verify the data collected with previous financial statements and their supporting documents. The data for the financial statements were collected in each subsidiary and sent to the corporate offices for preparation of the corporate financial statements. The intent of the data collection for the global data warehouse was to develop information about the purchase of raw materials from vendors on a global basis. The subsidiary financial data was collected using spreadsheets which were completed in each subsidiary and processed in the corporate accounting system. The data for the global data warehouse was captured directly from existing computer systems used by the subsidiary. When checking the data, it was found that an error had been made while transforming the data from the spreadsheet into the corporate accounting system. A decimal point had been misplaced resulting in a difference in the millions of dollars. The reason this error occurred was that the base unit of measurement differed for each. In the operational systems the monetary values were recorded in actual currency units and the spreadsheet recorded the data in thousands of dollars. The decision was made to record the data in the global data warehouse as collected from the operational systems and to not restate the financial statements because the difference was not material. This leads to an interesting data quality question about accuracy. There were now two sets of data, each purported to be accurate, that existed in the global data warehouse for the same activity in the organization. In this instance, how important was it to the organization that the two numbers be identical? Was complete accuracy necessary for the successful functioning of the organization? Two people using these two sets of data to analyze and recommend a possible course of action for the organization could have different conclusions. Was the organization comfortable with this situation?

While data accuracy has to be the foundation for high data quality within an organization, it is not the only characteristic of the data that determines the level of data quality. We will continue our discussion over the next few articles by examining other data characteristics that impact data quality and the interaction between these different data quality dimensions.

We look forward to our continuing conversations about information quality and wish you success in your information quality journey. If you have questions about what we have discussed or want more clarity about what we have said, contact us at eitherjimfunk@mit.eduorrwang@mit.edu, or visithttp://mitiq.mit.edu.

About the Author

Richard Y. Wang is Director of MIT Information Quality (MITIQ) Program at the Massachusetts Institute of Technology. He also holds an appointment as University Professor of Information Quality, University of Arkansas at Little Rock. Before heading the MITIQ program Dr. Wang served as a professor at MIT for a decade. He also served on the faculty of the University of Arizona and Boston University. Dr. Wang received a Ph.D. in InformationTechnology from MIT. Wang has put the term Information Quality on the intellectual map with myriad publications. In 1996, Prof. Wang organized the premier International Conference on Information Quality, which he has served as the general conference chair and currently serves as Chairman of the Board. Wang’s books on information quality include Quality Information and Knowledge (Prentice Hall, 1999), Data Quality (Kluwer Academic, 2001), Introduction to Information Quality (MITIQ Publications, 2005), and Journey to Data Quality (MIT Press, 2006). Prof. Wang has been instrumental in the establishment of the Master of Science in Information Quality degree program at the University of Arkansas at Little Rock (25 students enrolled in the first offering in September 2005), the Stuart Madnick IQ Best Paper Award for the International Conference on Information Quality (the first award was made in 2006), the comprehensive IQ Ph.D. dissertations website, and the Donald Ballou & Harry Pazer IQ Ph.D. Dissertation Award. Wang’s current research focuses on extending information quality to enterprise issues such as architecture, governance, and data sharing. Additionally, he heads a U.S. Government project on Leadership in Enterprise Architecture Deployment (LEAD). The MITIQ program offers certificate programs and executive courses on information quality. Dr. Wang is the recipient of the 2005 DAMA International Academic Achievement Award (previous recipients of this award include Ted Codd for the Relational Data model, Peter Chen for the Entity Relationship model, and Bill Inman for data warehouse contributions to the data management field). He has given numerous speeches in the public and private sectors internationally, including a thought-leader presentation to some 25 CIO’s at a gathering of the Advanced Practices Council of the Society of Information Management (SIM APC) in 2007. Dr. Wang can be reached at rwang@mit.edu, http://mitiq.mit.edu