Data Warehouse Goals and Objectives: Part 2
By Larissa Moss
Short Term Objectives
Larissa Moss and Sid Adelman
This article is excerpted from a book tentatively titled, Data Warehouse Project Management (Addison-Wesley, Spring 2000). All material is copyright Addison Wesley Longman, Sid Adelman and Larissa Moss. It is the second in a three-part series on Data Warehouse Goals and Objectives.
In our previous article (DM Review, date) we identified all the deficiencies in traditional decision support systems (DSS), and we mentioned the difficulties of data management.
We can now recognize the role a data warehouse plays, or ought to play, in a data management solution. A data warehouse is not just another DSS database. It is an environment of one or more databases designed to deliver consistent and reconciled business intelligence to all business units in the organization.
Short Term Objectives
To avoid the same calamity that befell Information Engineering when trying to correct all data management problems in one Big-Bang approach, you want to separate your data warehouse objectives into two categories: short term objectives and long term objectives.
Short term objectives are those you can realize with every data warehouse iteration. They represent the immediate benefits to the users. Here are some examples of short term objectives:
- Improve quality of data – Since a common DSS deficiency is “dirty data”, it is almost guaranteed that you will have to address the quality of your data during every data warehouse iteration. Data cleansing is a real “sticky” problem in data warehousing. On the one hand a data warehouse is supposed to provide “clean, integrated, consistent, and reconciled data from multiple sources” and on the other hand we are faced with a development schedule of 6-12 months. It is almost impossible to achieve both without making some compromises. The difficulty lies in determining what compromises to make. Here are some guidelines for determining your specific goal to cleanse your source data:
- Never try to cleanse ALL the data – Everyone would like to have all the data perfectly clean, but nobody is willing to pay for the cleansing or to wait for it to get done. To clean it all would simply take too long. The time and cost involved often exceeds the benefit.
- Never cleanse NOTHING – In other words, always plan to clean something. After all, one of the reasons for building the data warehouse is to provide cleaner and more reliable data than you have in your existing OLTP or DSS systems.
- Determine the benefit of having clean data – Examine the reasons for building the data warehouse:
- Do you have inconsistent reports?
- What is the cause for these inconsistencies?
- Is the cause dirty data or is it programming errors?
- What dollars are lost due to dirty data?
- Which data is dirty?
- Determine the cost for cleansing the data – Before you make cleansing all the dirty data your goal, you must determine the cleansing cost for each dirty data element. Examine how long it would take to perform the following tasks:
- analyze the data
- determine the correct data values and correction algorithms
- write the data cleansing programs
- correct the old files and databases (if appropriate)
- Compare cost for cleansing to dollars lost by leaving it dirty – Everything in business must be cost-justified. This applies to data cleansing as well. For each data element, compare the cost for cleansing it to the business loss being incurred by leaving it dirty, and decide whether to include it in your data cleansing goal.
- If dollars lost exceeds the cost of cleansing, put the data on the “to be cleansed” list,
- If cost for cleansing exceeds dollars lost, do not put the data on the “to be cleansed” list.
- Prioritize the dirty data you considered for your data cleansing goal – A difficult part of compromising is balancing the time you have for the project with the goals you are trying to achieve. Even though you may have been cautious in selecting dirty data for your cleansing goal, you may still have too much dirty data on your “to be cleansed list”. Prioritize your list.
- For each prioritized dirty data item ask: Can it be cleansed? – You may have to do some research to find out whether the “good data” still exist anywhere. Places to search could be other files and databases, old documentation, manual file folders, even desk drawers. Sometimes the data values are so convoluted, that you may have to find some “old-timers” who still remember what all the data values meant, and who will help you write the transformation logic. Then there will be times when, after several days of research, you find out that you couldn’t cleanse a data element even if you wanted to, and you have to remove the item from your cleansing goal.
As you document your data cleansing goal, you want to include the following information:
- the degree of current “dirtiness” (either by percentage or number of records)
- the dollars lost due to its “dirtiness”
- the cost for cleansing it
- the degree of “cleanliness” you want to achieve (either by percentage or number of records)
- Minimize inconsistent reports – Addressing another common complaint about current DSS environments, namely inconsistent reports, will most likely become one of your data warehouse goals. Inconsistent reports are mainly caused by misuse of data, and the primary reason for misuse of data is disagreement or misunderstanding of the meaning or the content of data. Correcting this problem is another predicament in data warehousing, because it requires the interested business units to resolve their disagreements or misunderstandings. This type of effort has more than once torpedoed a data warehouse project because it took too long to resolve the disputes. Ignoring the issue is not a solution either. We suggest the following guidelines:
- Identify all the data in dispute – Examine carefully how the disagreement or misunderstanding of the data contributes to producing inconsistent report totals.
- Determine the impact of the inconsistent report totals
- How seriously are they compromising business decisions?
- What dollars are lost due to bad decisions?
- Are the differences significant?
- How easy are the reports to reconcile?
- Determine the cost for resolving data in dispute – Estimate how long it would take to:
- get the involved business units to commit to the process of resolving their disputes
- analyze the data disputes and model the different user views
- separate the consistent views from the inconsistent views
- come to an understanding on definitions and content of data for the consistent views
- create new data for the inconsistent views
- come to an understanding on definitions and content of the new data
- Compare cost for data resolution to dollars lost by leaving data disputes unresolved – A cost-benefit must be demonstrated before including the resolution of a data dispute in your goal:
- if dollars lost exceeds the cost of resolution, put the data dispute on the “to be resolved” list,
- if cost for resolution exceeds dollars lost, do not put the data dispute on the “to be resolved” list.
- Prioritize the data resolutions you consider tackling – Anyone who has ever participated in a data resolution session knows how time consuming it can be. Your project schedule may not allow you to resolve all the data in dispute. You should therefore prioritize your list.
As you did with your data cleansing goal, you want to document the following information for your “minimize inconsistent reports” goal:
- the degree of impact on business decisions
- the dollars lost due to data disputes
- the cost for resolving the disputes
- the degree of “resolution” you want to achieve – Examples:
- Do all users have to agree or only the two main users?
- Do the totals have to agree 100% or is a 5% variance acceptable?
- If resolution cannot be achieved in X days, can the data be dropped?
- Capture and provide access to meta data – Meta data until now has always been considered the dirty D word: Documentation. However, Meta data is indispensable for (a) data sharing and (b) data navigation.
- Data Sharing
As we discussed before, most data is not being shared today for a number of reasons. Besides the turf war reasons, one reason is not understanding the data, another reason is not trusting the data content. We already established the fact that in order to correct this problem users have to discuss their views of the data and discover their commonality and differences. Two major goals of this discussion are commonly agreed upon data definitions and commonly agreed upon domains (valid data values). Because these two goals are often misunderstood and declared as unattainable and a waste of time, we must be clear on what we mean by these goals.The process of achieving commonly agreed upon definitions and domains does not mean that hundreds of users are arguing ad infinitum about who is right and who is wrong, and the desired result is not a declaration of victory by the most powerful user having forced his or her opinion on the other users. The process involves a small group of people, usually five or six, consisting of a facilitator, the data owner, and one authoritative representative from each business unit which is using the data for making significant business decisions. When strong disagreements surface about the meaning or content of the data, it is an instant indication of high probability that all of the disagreeing parties “are right” and that more than one data element exists. This probability is explored within a pre-defined short time frame, usually no more than a few days, and a new data element is created, named and defined by the group. If the exploration does not yield a new data element, the data owner makes the final decision on definition and content. Assuming there is no turf war between the disagreeing parties, the definition and content will, to some extent, include any reasonable variations perceived by the other involved business units. The now agreed upon definitions and content for the original and for the new data elements is documented in a meta data repository and is made available to all other users in the organization.
- Data Navigation
We like to think of meta data as the nice N word: Navigation. Once the source data has been cleansed, transformed, aggregated, summarized, and dissected in numerous other ways, the users will never find it again in the data warehouse without the help of meta data. Capturing the meta data, i.e. the data definitions, the domains, the algorithms for transforming the source data, the columns and tables the resulting data reside in, and all the other technical components, is only half of the solution. The other half is making meta data easily accessible and useful to the users.
- Data Sharing
- Provide capability for data sharing – If data sharing is one of your data warehouse goals, you also must include some data cleansing, data dispute resolution, and meta data access components as means of achieving the goal. These components are prerequisites to data sharing. Two other vital components are database design and database access.
- Database design – After the requirements have been analyzed, the requested data has been logically modeled, and the related meta data has been captured in the repository, the next step is database design. Designing a stand-alone database for one business unit is different than designing a shared database for multiple business units. It isn’t just a matter of granting access to more users, but a matter of designing a database based on (a) the lowest level of detail and granularity necessary to satisfy all the different data needs and (b) the type of access required by the different business units.
There are many design choices depending on the mix of requirements. When you define data sharing as a goal, you must be specific about:
- technical literacy level of users,
- business knowledge,
- the level of detail data required by all users,
- the types of summarization and aggregation requested,
- the types of queries each user will write,
- the periodicity needed, i.e. daily, weekly, or monthly snapshots.
- Database access – As with meta data, getting the data into the database is only half the battle. Providing easy access to it is the other half. Not all users are created equal. There are power users, some of them may even qualify as programmers, and there are technophobes, who need pull-down menus and radio buttons to navigate. Then there are all the competency levels in between. You need to accommodate a wide spectrum of users with multiple different query and reporting tools.
When you document your data sharing goal, describe the users in terms of:
- overall technical literacy level,
- the types of queries they are capable of writing,
- whether they will need to manipulate query results,
- what summary views they need,
- what ad hoc versus report writing capabilities they need,
- how often they will access the system,
- whether they have prior experience with a query or reporting tool,
- their proficiency level with a query or reporting tool,
- their ability and speed of learning new tools.
This information will be most valuable for tool evaluation and selection, as well as for training.
- Database design – After the requirements have been analyzed, the requested data has been logically modeled, and the related meta data has been captured in the repository, the next step is database design. Designing a stand-alone database for one business unit is different than designing a shared database for multiple business units. It isn’t just a matter of granting access to more users, but a matter of designing a database based on (a) the lowest level of detail and granularity necessary to satisfy all the different data needs and (b) the type of access required by the different business units.
- Integrate data from multiple sources – This is another primary goal for all data warehouses, because it is a primary deficiency in current DSS. A frequent lament is “it takes me days to merge data manually from four different systems because there is no common key between the files”. Stand-alone systems, which have the same data identified by different keys, is only one of many reasons why data integration does not exist in most companies. Some other reasons are that the data content in one file is at a different level of granularity from that in another file, or that the same data in one file is updated at a different time period from that in another file. In a shared data environment the requirements from different business units regularly include data relationships which do not exist in current systems. This often means that the necessary foreign key to implement the requested relationship does not exist in the source files.
Before you define your data integration goal, review your current DSS deficiencies and analyze the source systems you have identified as possible feeds to your data warehouse. Document the following:
- whether the keys for the same data have the same data type, length, and domain,
- whether the same data is identified by the same key value,
- whether new data relationships can be implemented,
- the granularity of the data content,
- periodicity of data updates.
- Merge historical data with current data – A typical data warehouse objective is to store history. This objective comes with its own challenges. Historical data is seldom kept on the operational systems, and even if it is kept, rarely will you find three or five years of history in one file. First of all, historical data is not of as much use to the daily operational processing of a business function as it is to decision support. Second, operational files do change over time, and reloading historical data to match the new records layouts would not be cost-justified. Third, operational history is a point-in-time transaction history, not a periodic snapshot in time. Point-in-time transaction history means a record is written to the file each time a transaction (change) occurs. Periodic snapshot means a record is written to the file once for each period (daily, monthly, etc.) regardless how many transactions occurred within that period.
Having said all that, you must define the following detail for your goal to merge historical data with current data:
- the number of years for which you wish to keep history
- whether history will be collected from this point (initial loading) forward or whether you will load “X” number of past years
- whether the history files have the same record layout
- whether the format of the data has changed over time
- whether the meaning of the domain (valid values) has changed over time
- whether the organizational hierarchy has changed over time
- how much history is actually available on disk or on tape
Conclusion
It is important to have clear objectives for a data warehouse, and it is equally important for these objectives to be realistic and cost effective. In addition, objectives should be prioritized, because your project schedule may not allow you to accomplish them all.
In our next article we will explore the data management topic further by discussion long term data warehouse objectives.
About the Authors
Larissa Moss is president of Method Focus Inc., a consulting firm specializing in data warehousing. Larissa has published numerous articles on various data management topics. She frequently teaches seminars in the United States, Canada and Europe.
Larissa can be reached at [email protected]
Sid Adelman is a charter member of Business Information Alliance. He co-authored a methodology and project planning product tailored specifically for data warehouse. Sid is an international speaker at data warehouse and industry conferences.
He can be reached at [email protected]