Moving Your ETL Process Into Primetime (Part II)
In the last issue, we reviewed the three sets of programs you need to develop in order to populate your analytical database in your data warehouse or business intelligence (BI) application: the initial load, the historical load, and the incremental load programs. In this issue, we will examine some important – but frequently overlooked considerations when designing the actual extract, transform, and load programs.
Designing the Extract Programs
From an operational systems perspective, the most favored way to create extracts might be to just duplicate the entire contents of the operational source files and source databases and to give the duplicates to the BI project team. However, the ETL developers would have the burden of working with huge files when they only need a subset of the source data.
From the BI project perspective, the most favored way to create extracts might be to sort, filter, cleanse, and aggregate all the required data in one step if possible and to do it right at the source. However, in some organizations that would impact the operational systems to such a degree that operational business functions would have to be suspended for several hours.
The solution is usually a compromise: the extract programs are designed for the most efficient ETL processing, but always with a focus on getting the required source data as quickly as possible. The goal is to get out of the way of operational systems so that the daily business functions are not affected. This is easier said than done, for a number of reasons.
Selecting and merging data from source files and source databases can be challenging because of the high data redundancy in operational systems. The extract programs must know which of the redundant source files or source databases are the systems of record. For example, the same source data element (e.g., Customer Name) usually exists in dozens of source files and source databases. These redundant occurrences have to be sorted out and consolidated, which involves a number of sort and merge steps, driven by a number of lookup tables cross-referencing specific keys and data values.
Another way to produce small and relatively clean extract files is to extract only those source data elements that are needed for the BI application and to resolve only those source data quality problems that pertain to the data domain rules, without attempting to sort out and consolidate redundant occurrences of data. However, even that compromise will not work in many large organizations because the data-cleansing process would slow down the extract process, which in turn would tie up the operational systems longer than is acceptable.
In many large organizations, the BI project team is lucky to get three to four hours of processing time against the operational systems before those operational systems have to “go live” for the operational functions of the next business day.
Designing the Transformation Programs
Using the 80/20 rule, 80 percent of ETL work occurs in the “T” (transform) portion when extensive data integration and data cleansing are required, while extracting and loading represent only 20 percent of the ETL development effort. There are three types of transformation rules to consider:
- Technical conversion rules. There are three types of technical rules that must be observed during source to target mapping:
- The data types of the source data elements must be converted to match the data types of the target columns
- The data lengths of the target columns must be adequate to allow the source data elements to be moved, expanded, or truncated
- The logic of the programs manipulating the target columns must be compatible with the domain of the source data elements
- Business data domain rules are rules about the semantics of data content. Examples of these rules are as follows:
- A data element should not have missing values
- A data domain should not include arbitrary default values, for example, age = 999
- A data domain should not contain intelligent default values, for example, a social security number 888-88-8888 indicating a “non-resident alien status”
- A data element should not contain embedded logic, for example, an account number 0923784327 where 0923 is an embedded bank branch number
- A data domain should not be cryptic or overused, for example a code field with values “A, B, C, D, E, F, G” where “A and B” indicate customer type, “C and D” indicate product type, and “E, F, and G” indicate promotion periods.
- A data element should not have multiple meanings or be purposely redefined, for example redefined fields in a COBOL statement
- Data elements should not be concatenated or wrapped around free-form text fields, for example address line1, address line2, address line3, address line4
- Business data integrity rules govern the semantic content among dependent or related data elements, a well as constraints imposed by business rules and business policies. Examples of these rules are as follows:
- Dependent data elements should not contain contradicting or inaccurate data values, for example, Boston, CA (instead of MA)
- A data element should not violate business rules, for example a person’s date of death cannot precede the person’s date of birth
- A primary key value cannot be reused again, for example reassigning the primary key of a retired employee to a new employee just hired
- A primary key must be unique, for example one customer cannot have more than one customer number
- Dependent objects cannot violate referential integrity, for example a task assigned to the employee Joe Drucker, when there is no row for Joe Drucker in the EMPLOYEE database
Many organizations unwisely pay attention only to the technical conversion rules and ignore the business data rules, thereby propagating many data quality problems into the BI databases. It is vital to the success of any BI application that propagation of dirty data be minimized. Therefore, the BI project team must take the time to find violations to business data rules, and the business people must make some hard decisions on what to do with the violations. They can choose to correct the violations, reject the erroneous data, or accept it as is and publish the non-reliability measures so that other business people can make informed decisions when using poor-quality data in the BI databases.
It should not be surprising that the data in the BI databases will look quite different than the data in the operational systems. Some specific examples appear below.
Some of the data will be renamed following new naming standards (synonyms and homonyms should not be propagated into the BI environment). For example, the data element Account Flag may now be called Product_Type_Code.
Some data elements from different operational systems will be combined (merged) into one column in a BI table because they represent the same logical data element. For example, Cust-Name from the CMAST file, Customer_Nm from the CRM_CUST table, and Cust_Acct_Nm from the CACCT table may now be merged into the column Customer_Name in the CUSTOMER table.
Some data elements will be split across different columns in the BI database because they are being used for multiple purposes by the operational systems. For example, the values “A”, “B”, “C”, “L”, “M”, “N”, “X”, “Y”, and “Z” of the source data element Mstr-Code may be used as follows by the operational system: “A,” “B,” and “C” describe customers; “L,” “M,” and “N” describe suppliers; and “X,” “Y,” and “Z” describe regional constraints. As a result, Mstr-Code must now be split into three columns:
Customer_Type_Code in the CUSTOMER table
Supplier_Type_Code in the SUPPLIER table
Regional_Constraint_Code in the ORG_UNIT table
Some code data elements will be translated into mnemonics or will be spelled out. For example:
“A” may be translated to “Corporation”
“B” may be translated to “Partnership”
“C” may be translated to “Individual”
In addition, most of the data will be aggregated and summarized based on required reporting patterns and based on the selected multidimensional database structure (star schema, snowflake). For example, at the end of the month, the source data elements Mortgage-Loan-Balance, Construction-Loan-Balance, and Consumer-Loan-Amount may be added up (aggregated) and summarized by region into the column Monthly_Regional_Portfolio_Amount in the PORTFOLIO fact table.
Designing the Load Programs
The third step in the ETL process is loading the BI databases, which can be accomplished in one of two ways: (1) by inserting the new rows into the tables or (2) by using the DBMS load utility to perform the load. It is much more efficient to use the load utility of the DBMS, and most organizations choose that approach.
Once the extract and transformation steps are accomplished, it should not be too complicated to complete the ETL process with the load step. However, it is still necessary to make design decisions about referential integrity and indexing.
Because of the huge volumes of data, many organizations prefer to turn off RI to speed up the load process. However, in that case the ETL programs must perform the necessary RI checks; otherwise, the BI databases can become corrupt within a few months or even weeks. Acting on the idea that RI checking is not needed for BI applications (because no new data relationships are created and only existing operational data is loaded) does not prevent database corruption! Corruption of BI databases often does occur, mainly because operational data is often not properly related in the first place, especially when the operational data is not in a relational database. Even if the operational data comes from a relational database, there is no guarantee of properly enforced RI because too many relational database designs are no more than unrelated flat files in tables.
When RI is turned off during the ETL load process (as it should be, for performance reasons), it is recommended to turn it back on again after the load process has completed in order to let the DBMS determine any RI violations between dependent tables.
Poorly performing databases are often the result of poorly performing indexing schemes. Since BI databases are accessed in a variety of ways, they usually have to be heavily indexed. However, building index entries while loading the BI tables slows down the ETL load process. Thus, it is advisable to drop all indices before the ETL load process, load the BI databases, and then recreate the indices after completing the ETL load process and checking RI.
Think of a BI project as a system redesign or a business process improvement project, where you want to change the data in order to improve it. A well-designed ETL process, which includes transformations to enforce business data domain rules and business data integrity rules, is the vehicle to accomplish that. Next time, we will conclude this 3-part series with a discussion of designing the ETL process flow and staging area.