The Data Warehouse Database Explosion
By Sid Adelman
At a recent data warehouse (DW) conference, the discussion centered on the incredible explosion of the size of DW databases. Many companies are already implementing in the multiple terabyte range and it’s almost a macho boast that “my DW database is bigger than yours.” There seem to be few who question the unbridled growth or ask if all that data is really needed or if the data will even be accessed.
Conferences of Very Large Databases (VLDB) and tracks on VLDB at DW seminars attract speakers from vendors who promote the ability of their products to easily handle multiple terabytes. Attendees busily take notes and collect brochures that will help them with their technical evaluations of the products. These conferences provide little to help the attendees determine if their databases should be growing so large and so fast. The focus is on how to establish the VLDBs, not on how to control the growth or whether the VLDBs should be created at all.
There is the old joke of God telling a skeptical Moses about the commandments. When Moses asked the price, God told him they were free. Moses then responded, “I’ll take ten.” In the same way, when there is no cost to the users of storing data in the DW, when they are asked what data they want, they respond, “I want all the data.” In most DWs today, the users are not charged for their use of the data or for the data they requested to be in the DW.
Users have a requirement to perform trend analysis and so historical data must be maintained. The question then is “How much data should be stored?” If it costs the users the same to keep five years of data as it does to keep two years, the users will ask for five. However, when they are charged the cost of the additional data, users will more carefully weigh the benefits and not automatically ask for the additional years.
The frequency of keeping historical data is also determined by the user. It may be necessary (or cost effective) to keep daily or weekly data but less frequent capture may be adequate. Keeping data weekly or monthly rather than daily or weekly can significantly reduce the data storage requirement.
For certain analysis, instance data is required. For other analyses, summarized data is adequate and instance data may not be necessary. Monitoring usage provides information on which mode of data is accessed.
In many DW implementations, there appears to be little governance of the data being migrated, the frequency of the updates, the storage of instance versus summarized data or any controls on the number of years being stored. Without clear standards the data will proliferate at a rate that will warm the hearts of the hardware vendors.
Reasons for the Explosion
The explosion has many causes. There often exist extract programs which pull the data from existing databases. These programs can be used without much thought of which data to migrate to the DW. This means that all the data is delivered including data that will never be accessed.
In their haste to get something up and running, the process of accurately determining user requirements is either bypassed or done so rapidly and with such little user orientation and involvement that the analysts feel it is less risky to store all data rather than risk excluding some data that may be needed.
Some users play hard to get. It’s difficult to schedule enough time with them to determine their requirements. Some analysts don’t know how to ask about which data is required. Analysts are sometimes intimidated by users and bypass the requirements gathering to avoid talking to users. To be on the safe side, data analysts may choose to dump all the data from the operational systems to the DW.
Data may be stored in multiple places because new requesters may not know that the DW data is already there. This often comes from the absence of a populated and maintained repository of metadata or a lack of a methodology and standards that includes research into existing DW data. The responsibility for controlling what goes into the DW may not have been defined or the position may not have been staffed. Even when such a role has been identified and staffed — usually the role is that of a Data Administrator — the authority may not have been given to provide stewardship to the contents of the DW.
Another cause of explosive growth is the transformation of codes to values that are more easily understood by the users. For example, the code “M” in the Human Resource database would be transformed to “Male,” the procedure 0402 in a doctor’s office would be transformed to “Flu immunization.” This transformation has resulted from the mistaken notion that the only way the user can see “Flu immunization” is for the words, not just the code to be stored in the database. The transformation need not reside in the data, it can be accomplished readily by many of the BI tools. Few organizations have standards that guide analysts in their decision as to where the transformations should be made.
A further cause of the explosion is the replication of the DW in more than one database or replication in a data distribution process. The data may have been legitimately replicated as part of a datamart strategy. The replicated data may be justifiable and well-considered or may be the result of a lack of a data architecture that includes guidelines for replicating data. Data may be replicated because of the availability of software that provides replication (Unused software is a thorn in the side of a purchaser.).
The wrong question may have been asked. “What data do you want?” Because the user is not sure and may feel that if data is not requested now, it may never become available. They will ask for all the data in the operational databases. However, if the question is phrased “Which data do you want first?” the user is more likely to carefully consider the answer and not be so frightened about a partial request foreclosing future data requests. The second phrasing also reinforces the recommended approach of not attempting to deliver all the data in the first phase.
Problems/Costs of the Explosion
Why should an organization spend their valuable time trying to control the size of their databases since it is so inexpensive to store the data? The hardware is getting faster and cheaper. The RDBMS are getting smarter in supporting VLDBs and performance is improving all the time. The answer is that the data explosion does result in increased cost, poor performance, unavailability, unhappy users and greater risk.
While disks are cheap, they still have some cost. The hardware to migrate the data and run the queries increases as the size of the database increases. As costs increase, systems may no longer be cost justified. The knowledge and size of the staff to support the VLDBs also increases as the databases get larger. It takes smart DBAs to design and manage a VLDB. Knowledgeable people are always in short supply and may be unavailable to the DW project.
Large databases almost always cause performance problems. The impact is on queries, network traffic, and on the loads/refreshes/updates. Smaller databases will take less time to load/refresh/update. Long load/refresh/update times may result in the DW being unavailable to the users in a timely fashion. They may not see yesterday’s results until 11:00 AM. They wanted the information when they walked in at 8:30 AM.
Depending on the type of query and the indexing scheme, the query will almost always take longer with a large database and users may be unhappy with poor response times. If there is a chargeback process, the users will see much higher bills.
VLDBs with their performance problems, unhappy users, increased cost and availability problems magnify the risk of the project failing.
What’s An Organization To Do?
A company recently monitored which of their reports were actually being used. The sad results were that less than 25% of the reports were ever read. The company shrewdly halted the unread reports and only reinstated them when users demanded their return. Understanding the situation can reveal opportunities to improve.
If an organization understands the problems of letting the databases increase in an uncontrolled manner and is willing to take appropriate steps, it should consider the following:
Recognize that size is important (in this case, big is bad) and that it can have a strongly negative impact on the cost, performance, availability, risk and the management of the DW. This recognition needs to be communicated to all those on the DW team.
Develop data standards that, among other things, include standards on transformations, data replication, data stewardship and naming standards.
Implement a cost justification methodology that can be used uniformly as funds are being requested. The hard numbers can help management understand and evaluate options.
If possible, charge the users for the data they request. This can be very difficult because other departments may also be using the data and distributing the costs may become complex. Chargebacks are never appreciated but they are one of the few techniques that force users to make a business case for additional data.
Allow enough time to implement the DW properly. This admonition extends far beyond the data explosion concern. Management should understand the costs and risks of an unrealistic schedule.
The project should be phased and the users should understand the plan and know when each grouping of data would become available to them.
Educate and sell the users on the importance of their participation in the process of determining data requirements. Instruct them on the costs and problems of frivolous requests for data.
With the users, evaluate the necessity of keeping instance data as well as summarized data. This requirement will probably change as the data ages. Older instance data may be archived while the summarized data will probably be retained.
Carefully consider the need to replicate data. Many organization replicate for questionable reasons (“The user wants his data to be under his physical control.”).
Understand the need for how frequently the data is stored (daily, weekly, monthly).
Allow sharing of data rather than each user having their own copy of DW data. If possible, encourage sharing by lowering user costs or appealing to higher motives. This might be a good time to consider master data management.
If data models have captured the structure of the DW and if a repository has been populated with metadata about the DW, the models and the metadata should be used to understand the inventory of DW data and to minimize storing redundant data.
Give Data Administration the authority to determine data sources and give them the responsibility for populating metadata in a repository.
Give Data Administration or the DW Administrator the authority to control what data will be stored in the DW. This would minimize storing redundant or unneeded data.
Be sure the right analysts are working on the project. They need to feel comfortable with the users and know what questions to ask. Consider a team that has analysts that come from the user departments.
Monitor the usage of the data. Consider archiving data that is rarely accessed and archiving data that is never accessed (Murphy’s corollary is that purged data will be requested three weeks following deletion.).
Review archiving schemes to determine how much of historical data must be online and whether all instance data must be kept online. Determine response time requirements and if historical data may be restored from less costly, but slower, storage media.
Consider the use of datamarts. They will not decrease the overall amount of data to be stored or the CPU to run the loads but they can substantially improve the performance of the queries.
The steps above are relevant for most DWs regardless of size.
A very large DW database should not automatically be considered as a source of pride buy may be an indication of IT sloth. Organizations need to recognize the source of the disorderly growth of their databases and understand whether that growth is justified.
Many of the problems highlighted in this piece are a result of lack of a strategic DW plan. The problems resulting from the explosion of data may just be symptomatic of a larger problem and could be seen as a wake-up call to organizations to review their DW architecture, strategy, standards and methodologies.
About the Author
Sid Adelman is a principal consultant with Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses, performing data warehouse and BI assessments, and in establishing effective data strategies. He is a regular speaker at “The Data Warehouse Institute” and IBM’s “DB2 and Data Warehouse Conference”. Sid chairs the “Ask the Experts” column on www.dmreview.com, and has had a bi-monthly column in DMReview. He is a frequent contributor to journals that focus on data warehousing. He co-authored one of the initial works in data warehousing, Data Warehousing, Practical Advice from the Experts, and is co-author of Data Warehouse Project Management with Larissa Moss. He is the principal author of Impossible Data Warehouse Situations with Solutions from the Experts and his newest book, Data Strategy, was co-authored by Larissa Moss and Majid Abai. He can be reached at 818 783 9634and firstname.lastname@example.org. His web site is www.sidadelman.com.