Moving Your ETL Process Into Primetime (Part I)
Moving Your Extract/Transform/Load (ETL) Process Into Primetime is presented in a series of three short articles. In this issue, we will review the three sets of programs you need to develop in order to populate your analytical database in your data warehouse or business intelligence (BI) application.
When listening to ETL vendors describe their tools, you may get the impression that ETL is a relatively easy technical conversion process of source data from operational source files and databases into BI databases. However, project teams quickly discover that the ETL process is not as simple as it looks – with or without ETL tools.
Source data for the BI applications will come from a variety of platforms, which are managed by a variety of operating systems and operational applications. While most source data for the ETL process is current operational data from the operational systems, some of the source data may be archived historical data. If the data requirements include a few years of history to be backfilled from the start, three sets of ETL programs must be designed and developed, as listed in Table 1.
Table 1. Sets of ETL Programs
The Initial Load
The process of preparing the initial load programs is very similar to a system conversion process, such as the one many organizations performed when they moved their old operational systems to an enterprise resource planning (ERP) product. In general, the first task of a system conversion process is to map selected data elements from the source files or source databases to the most appropriate data elements in the target files or target databases. The second task of a system conversion process is to write the conversion (transformation) programs to transform the source data. These conversion programs must also resolve duplicate records, match the primary keys, and truncate or enlarge the size of the data elements.
Typically missing from traditional conversion programs, and unfortunately, also missing from many ETL processes for BI applications, are data cleansing and reconciliation logic. Organizations repeatedly miss prime opportunities to bring order to their data chaos when they continue to “suck and plunk” the data from source to target as-is. Their only concern is that the receiving database structure does not reject the source data for technical reasons, such as duplicate keys, or data type and length violations. That is not good enough for BI applications because business people expect data quality and data consistency, which means that data must also be transformed for business reasons. We will come back to this topic in the next two issues when we discuss the design considerations for the ETL programs and the ETL process flow.
The Historical Load
The historical load process could be viewed as an extension of the initial load process, but this type of conversion is slightly different because historical data is static data. In contrast to liveoperational data, static data has served its operational purpose and has been archived to offline storage devices. This means that, as some old data expires and some new data is added over the years, the record layouts of archived files are usually different from the record layouts of the current operational files. Therefore, the conversion programs written for the current operational files cannot be applied to archived historical files without some changes. For example, in a frequently changing operational system, it is not unusual for five years of archived historical files to have five (or more) slightly different record layouts. Even though the differences in the record layouts may not be drastic, they still have to be reconciled. In addition, the cleanliness of the data is most likely not the same across all archived files. What was once valid in a historical file may no longer be valid. The data transformation specifications have to address these differences and reconcile them. All these factors contribute to the reasons why the ETL process can get very lengthy and can be very complicated.
The Incremental Load
Once the processes for populating the BI databases with initial and historical data have been devised, another process must be designed for the ongoing incremental load (monthly, weekly, daily, hourly). Incremental loads can be accomplished in two ways, as shown in Table 2. The design of the ETL extract process will differ depending on which option is selected.
Table 2. Incremental Extract Options
Extracting all records is often not a viable option because of the huge data volumes involved. Therefore, many organizations opt for delta extracts (extracting only records that changed). Designing ETL programs for delta extraction is much easier when the source data resides on relational databases and the timestamp can be used for determining the deltas. But when the data is stored in flat files without a timestamp, the extract process can be significantly more complex. You may have to resort to reading the operational audit trails to determine which records have changed.
An alternative may be to extract a complete copy of the source file for every load, then compare the new extract to the previous extract to find the records that changed and create your own delta file. Another alternative is to ask the operational systems staff to add a system timestamp to their operational files. Occasionally they may agree to do that if the change to their operational systems is trivial and does not affect many programs. However, in most cases operations managers will not agree to that because any changes to their file structures would also require changes to their data entry and update programs. Additional code would have to be written for those programs to capture the system timestamp. It would not be cost-effective for them to change their mission-critical operational systems and spend a lot of time on regression testing – just for the benefit of a BI application.
Processing Deleted Records
Another aspect that needs to be carefully considered for incremental loads is that of deleted operational source records. When certain records are logically deleted from the source files and source databases (flagged as deleted but not physically removed), the corresponding rows cannot automatically be deleted from the BI databases. After all, one of the main requirements of BI databases is to store historical data.
The ETL process must follow a set of business rules, which should define when an operational deletion should propagate into the BI databases and when it should not. For example, perhaps an operational record is being deleted because it was previously created in error, or because the record is being archived, or because the operational system stores only “open” transactions and deletes the “closed” ones. Most likely, the business rules would state that you should delete the related row from the BI database only in the case where the record was created in error. Since your BI database stores historical data, the business rules would probably not allow you to delete the related row in the other two instances.
When records are physically deleted from the source files or source databases, you would never know it if you are extracting only deltas. Delta extract programs are designed to extract only those existing records in which one of the data values changed; they cannot extract records that no longer exist. One way to find the physically deleted records is to read the operational audit trails. Another option is to extract a complete copy of the source file, compare the new extract to the previous extract to find the records that were deleted, and then create your own delta files. In either case, once the deleted records are identified, the ETL process has to follow a set of business rules to decide whether or not to physically remove the related rows from the BI databases.
A data warehouse or BI application should not be treated like a systems conversion project, where you simply move from one technology platform to another while transferring the data as-is. Therefore, your take-away from this article is to identify and document the business rules that must guide the initial load, the historical load, and the ongoing incremental load processes. Next time, we will examine some very important – but frequently overlooked considerations when designing the extract, transform, and load programs. In the third part, we will conclude with a discussion of designing the ETL process flow and staging area.
About the Author
Ms. Moss is president of Method Focus Inc., a company specializing in improving the quality of business information systems. She has over 20 years of IT experience. She frequently lectures and presents at conferences in the United States, Canada, Europe, Australia, and Asia on the topics of Data Warehousing, Business Intelligence, Customer Relationship Management, Information Quality, and other information asset management topics, such as data integration and cross-organizational development. She provides consulting services in data warehouse and business intelligence assessment, spiral methodologies, project management, data administration, data modeling, data quality assessment, data transformation and cleansing, as well as meta data capture and utilization.
Ms. Moss is co-author of three books: Data Warehouse Project Management, Addison Wesley 2000; Impossible Data Warehouse Situations, Addison Wesley 2002; and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision Support Applications, Addison Wesley, 2003. Her articles are frequently published in DM Review, TDWI Journal of Data Warehousing, Teradata Magazine, and Cutter IT Journal, and her white papers are available through the Cutter Consortium. She can be reached at