Implementing Data Quality Through Meta Data (part 1 of 2)
By David Marco
A special thanks to Mike Jennings for his invaluable contribution to this month’s column
How are you addressing the single most difficult problem facing data warehouses today? Data Quality. When the quality of data is compromised incorrect interpretation and use of information from your data warehouse can destroy the confidence level of its customers, YOUR users. Once the user’s confidence in your warehouse is eroded it is a question of time before your system will no longer exist.
This data quality quandary often results from system architectures that fail to identify “bad” data before it is loaded into the data warehouse. This missed opportunity leads to a dramatic increase in the time and costs that companies expend to reconcile and audit information in the warehouse. Insertion of technical meta data “tags” directly into the data warehouse’s dimensional data model design and the extraction, transformation and loading (ETL) processes corrects this situation by providing a practical means to measure data quality precisely at a table row level of granularity.
This article is the first portion of a two-part series on implementing data quality through meta data. This installment examines the role meta data can have in the data warehouse model and data acquisition designs for information content and quality. Part two of the series will examine the beneficial technical meta data tags that can be incorporated into an architecture to measure data quality and provide flexibility to the system design.
The warehouse developers use technical meta data as a method to build a tighter relationship between the repository and the data warehouse. This is accomplished by incorporating technical meta data directly into the data warehouse design and ETL processes. This technique is used to extend the design and architecture of the data warehouse to provide increased processing optimizations for data acquisitions, maintenance activities, and data quality measurement opportunities. These technical meta data tags, unlike information stored in a meta data repository, are referenced at a row level of granularity in the data warehouse. This direct association of meta data to each row of information in the data warehouse is a key distinction of extending meta data into the architecture.
To select operators, each row of data is tagged from the source systems during ETL processing with technical meta data. The meta data tags on each row in the warehouse provide a clearer semantic meaning to the data by placing the information in context with the repository. As an example, consider a client dimension table that derives its information from two operational sources. Client information is extracted in priority order from either from a sales force automation application, an enterprise resource planning application (ERP), or both, depending on availability and stability of the data. The absence of technical meta data in the dimension table would require use of the information without consideration of the source system(s) that provided it. Technical meta data tagging allows you to determine the origin(s) of information in dimension table. Information originating from one or more sources can be easily and quickly determined through the technical meta data tag on that row.
A clear, consistent method of tagging the data originating from the operational systems needs to be developed and agreed upon by both the technical and business users of the data warehouse. Any technical meta data tied to the row must be applicable to the entire row of data, not just the majority of columns in the table.
I like to keep technical data tagging to a minimum in a simple dimensional data model design. Schemas that have only one or two fact tables or the case where a single operational system is source to the warehouse are examples of these basic cases. I prefer to increase its use in very complex schema designs, using multiple fact tables with several conformed dimension tables, or when integration of numerous source systems is required. These more complicated cases make mapping of technical meta data from the source systems more challenging.
The data warehouse team is responsible for reviewing the design, construction and maintenance impact resulting from the use of technical meta data to the repository, ETL processes, data model design, database sizing and front-end data access tools. For example, some ROLAP (relational online analytical processing) tools require a very strict adherence in the design of the data warehouse model in order to function properly or fully. This may preclude the use of some, or all, meta data tags on certain tables such as the time dimension. Certain types of technical meta data can require additional ETL processing times to occur that may interfere with tight processing window schedules. For example, careful consideration to ETL processing times needs to occur in the case where a meta data tag is used in the warehouse to indicate whether production keys, or natural keys, are still active or not in the source system.
The benefits of technical meta data use include source system identification, data quality measurement, improved management of ETL processes and database administration. Use of these technical data tags offers warehouse administrators and business users a means for measuring the content quality of the data in the warehouse. Adding these types of meta data tags into the warehouse model and ETL processes can help you to reconcile data quality issues in your environment. Such reconciliation offers an increase in data integrity benefiting both the warehouse technical and business users by increasing the level of confidence in the quality of the information content.