Implementing Data Quality Through Meta Data (Part 2 of 2)
By David Marco
A special thanks to Mike Jennings for his invaluable contribution to this article
This article is the concluding portion of a two-part series on implementing data quality through meta data. The first installment examined the role meta data can have in the data warehouse model and data acquisition designs for information content and quality. This segment will examine real world examples of technical meta data tags that can be incorporated into your designs to facilitate measurement of data quality and promote user confidence in the informational content of the warehouse. This meta data provides a semantic layer of knowledge about the information in your warehouse that is highly valuable to both business users and IT (information technology) development staff.
The most commonly used technical meta data tag is the load date column. It indicates when, date and/or time, a row of information was loaded into the warehouse. This “snapshot” date maintains temporal integrity of the data in the warehouse. The column can be referenced by warehouse administrators to identify candidate rows for archival/purge processing or by users to reconcile/audit information in the data warehouse with the source systems.
Another common technical meta data column is the update date. This column denotes when a row was last updated in the warehouse. Like the load date, this column maintains the historical meaning (temporal integrity) of information in the data warehouse. It is routinely used in dimensional models that implement slowly changing dimensions, (SCD) type 1 or 3 processing methods, to identify when the row was refreshed or updated. For those not familiar with the implementation of SCD, type 1 maintains a single row per production key(s) in the dimension table, which is updated as required over writing any history about the row. Type 3 also maintains one row per production key(s) but doubles the number of columns to keep both a current and previous view of the information. The column is used in administration activities such as archival/purge processing or reconciliation/audit by end users.
Load Cycle Identifier
One more technical meta data tag a data warehouse development team can incorporate is the load cycle identifier. This column is a sequential identifier assigned during each load cycle to the data warehouse regardless of the refresh frequency (e.g., daily, weekly, monthly, etc.). It can be used to easily remove data from a particular load cycle run if data corruption or other data quality issues arise. The load cycle identifier is typically used in conjunction with a meta data repository that describes other operational statistics about the load cycle. Using the repository alone you could determine how many and when load cycles occurred against the warehouse. Now, by tying the repository statistics to the actual warehouse content, you know exactly which rows were loaded and when.
Current Flag Indicator
The current flag indicator tag identifies the latest version of a row in a table. It facilitates quick identification of the latest version of a row as compared to performing date comparisons. This flag is especially useful for managing dimension tables using SCD, type two, where history of a production record is maintained. SCD 2 is used to model a dimension table when changes to relevant columns need to be captured over time. This technique relies on the use of a production key(s) in the dimension table not changing. New surrogate keys are assigned to the dimension table when changes to relevant columns are detected during a batch load cycle. Comparisons are made between the previously loaded production key(s) of the dimension table and the new load cycle data. Changes to relevant columns in the new load cycle data for matching production keys are loaded with new surrogate key assignments. This tag is also very useful in non-star like data model schema designs such as an atomic (most granular) data warehouse where structures tend to conform closer to third-normal form. Instead of querying a table for the latest date field, the ETL process assigns a “Y” to the latest record loaded for a production/natural key(s) while setting any previously loaded record to an “N”.
Operational System Identifier
One of the most useful technical meta data tags is the operational system identifier. This tag is used to track the originating source(s) of a data row in the warehouse. This tag allows identification of each row in a warehouse table to the sources used in its construction. This provides the user, repository architect, and data acquisition developer with a powerful means for identifying and measuring the quality of the data received from an operational source. In the common case where your ETL process is required to extract and merge data from multiple sources, the tag is assigned a value that represents this particular integration. For example, where a row of data is integrated from more than one operational source system (client information from a trouble ticket, order management and billing information) a column value indicating the combination of these systems is assigned.
Active Operational System Flag
This tag is used to indicate whether the production keys (rows of data) in a warehouse table are still active in the originating operational system or systems. The active operational system flag provides an analysis alternative to queries posed to the data warehouse. This column can be used effectively in a variety of analysis activities to identify dormant data or data that should be constrained in reporting.
Confidence Level Indicator
One of the more intriguing technical meta data tags is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a particular row of data through application of a ranking value. This tag provides a means of measure to a user as to the credibility level of a data row based on the transformation processing performed. It is used to identify potential problems with data quality from source systems and to facilitate correcting these issues. For example, data from a stable source like customer were loaded at the highest level. Data more volatile, easy to cleanse or relatively moderate to define, was loaded at the second level. The third level of data was considered more problematic to define, such as planning or forecasting data. The fourth level consisted of data not originating from a internal operational systems but provided by management, in the form of a spreadsheet. The fifth, level was used to tag the data from external sources such as new services or commercial sources.
Using these technical meta data tags allows your corporation the ability to vastly improve the overall data content quality by tying the meta data repository and decision support data models closer together.