Measuring the Data Warehouse
By Sid Adelman
Running a data warehouse without the advantage of metrics is like trying to navigate a ship without a chart, compass, or sextant. Without metrics, we have no way of knowing if we have delivered a data warehouse that anyone could consider to be successful. We would have no idea about response time, machine utilization, availability, user satisfaction or the quality of the data in the warehouse.
This column suggests metrics that are appropriate to the data warehouse, recommends standards, sometimes represented as service level agreements, suggests who should be responsible for measuring, who should be responsible for taking action to correct situations that are out of compliance with the standards, and recommends how to represent the results of the measurements to management.
Conformance to Measures of Success
Most projects have explicit or implicit measures of success and most of these can be measured. The measurements of a data warehouse will determine if it was or was not a success.
Types of Metrics
There are a number of types of metrics that are relevant for understanding how well we are doing.
Usage– Usage tells us if the data warehouse is being used, to what extent and by whom. For example, we may have a goal of 90% of the trained users using the data warehouse. Our metrics may show that our goal was met. 92% of the users, who were trained on the tools, ran a query or report during the previous month.
Metrics on usage are often an eye opener as you discover large numbers of identified users do not use the system or use it only sporadically. This type of information can point to deficiencies in training, in poor targeting of intended users, lack of predefined queries and reports, inadequate support, or lack of emphasis by the users’ management.
Performance– Performance is usually reflected in response time. While I don’t recommend a service level agreement (SLA) for response time, it is very important to measure how long a user has to wait for his or her answer to come back. It’s usually too late once the users start calling or screaming that response time is terrible – the damage has already been done to the reputation of the warehouse. You will want to know what percentage of the queries ran longer than, for example, 10 minutes, how long they actually ran, and you will want to know which departments experienced the long run times.
Measurements sometimes uncover poor performing queries that are the result of users not understanding the ramifications of some of their actions. The appropriate response could be more comprehensive training that focuses on the dos and don’ts of writing queries. The realization that a field is very frequently being summarized could lead to the creation of a summary table. As it becomes apparent that another field is frequently being accessed, or that tables are being joined on a specific key, the DBA would build an index for those fields.
ETL performance is particularly important for large and very large databases. The time to perform the processes of transformation, cleansing, aggregations, and loads can sometimes exceed the available window for the ETL process. The metrics on each of these processes can help determine the source of the problem and help direct where to focus the resources.
Performance metrics can anticipate problems and provide the diagnostic capability that suggests remediation before a performance problem rears its ugly head.
Availability– Availability is the percentage of time the system can be accessed by the users during scheduled hours. Scheduled hours may be 24 hours/day and 7 days/week for global systems or a subset such as 18 hours/day and five or six days/week. Most organizations do not have the same availability requirements for their data warehouse as they do for their operational systems.
Operations will sometimes not give the data warehouse environment the right level of attention and this results in poor availability. In addition, the ETL process aborting or not completing on time can also impact availability.
Resource utilization– This would include the number of machine cycles, memory usage and the accesses to the disk. The information about the percentage of disk utilization should result in a better distribution and partitioning of the data on the array of disks or should point out the necessity of purchasing additional disk.
User Satisfaction– User satisfaction surveys tell us how well the data warehouse meets the users’ expectations. Satisfaction surveys should be taken two to four times/year and the results of those surveys should promote action to improve the areas where the users are unhappy. Sometimes these surveys uncover misunderstandings in the way the system was intended to be used. User satisfaction surveys typically include questions on the use of the access and analysis tool, data quality, availability, response time, and support.
Data quality– Some aspects of the quality of the data can be automatically quantified and should include the percentage of values that are outside of the valid values, the percentage of fields that are missing, non-unique data, data that is the wrong data type, data that is outside of the acceptable ranges, and data that violates business rules. Metrics that have a bearing on the quality of data would include record counts, the number of distinct values, the number of records with null values, the number of records within a certain range, the number of records with a certain distinct value, and the rate of change of the data.
Management is always surprised and dismayed when they discover just how dirty their data is. The data quality metric will serve as a barometer of the constant data quality improvement that should be a part of every data warehouse initiative.
Dormant data– Dormant data is data that is never, ever accessed. Loading this unused data night after night is expensive, consumes disk space, wastes energy, and may reduce the likelihood that the system will be available to the users by 8:00 AM (when they were expecting it). Dormant data is a total waste to the organization and is the albatross that will weigh down the budget, extend ETL time, and tax the skills of the DBA staff.
Dormant data exists either because the requirements gathering process was lax or because the users were unable to sufficiently articulate their requirements resulting in loading useless data for fear that the users may possibly need it in the future. Dormant data remains because there may be no tools in place to even let the DBAs know that the data is not being accessed.
Use of tools(which tools are being used and at what level) – Since many data warehouse installations have multiple business intelligence tools, this metric should tell us which tools are used and to what extent. Most organizations have anticipated which tools will be used by which departments and by which category of users, e.g., power users, casual users, report recipients. When the tool use does not match the department and user profile, it’s important to understand why. There may be a misunderstanding of the tools and their purpose or the profiling may have been incorrect.
An organization may be purchasing software when it already has seats available and does not need to purchase any additional software. The organization may be paying maintenance for software sitting on the shelf. In both cases, metrics could identify opportunities to save money.
Costs– This is what the data warehouse costs, both on initial installation and as an ongoing expense. The value of measuring the costs, and comparing those costs to the anticipated costs (budget) will give the organization the information it needs to better anticipate costs and then to help determine if a project is cost justified before it actually gets implemented. Ongoing costs are often grossly underestimated. When anticipating costs, consider the total cost of ownership that would include items such as servers, software licenses, support staff, and so on.
Benefits– Presumably, the warehouse was justified by anticipated tangible and intangible benefits and it’s important to identify and quantify those benefits which would include cost reductions and increased revenue. Some of the intangibles do not lend themselves to quantification but it is important to identify them whenever possible. The organization must measure the benefits to determine if the anticipated benefits were achieved. This knowledge will help, along with the information gleaned from measuring actual costs, in adding to the enterprise knowledge of anticipating the ROI of each project and in setting priorities.
Security conformance– What security violations have been detected, where are the violations coming from and have there been any breaches? This information will help plug security holes and will also provide a level of comfort to upper management as they worry about security exposures.
Service Level Agreements (SLAs)
SLAs are written agreements between the business – the folks who will be using the data warehouse – and IT – the people who are responsible for building and providing the data warehouse infrastructure. The SLAs will identify your goals and the metrics will tell you if these goals have been met. Metrics supporting the SLAs may include:
- Response time
- Response to problems
SLAs let you know which of the user requirements are being met. SLAs also serve to hold user expectations in check.
Responsibility for Measurement
It’s not always clear who should be responsible for measurement. Some organizations have groups dedicated to performance and these are the folks who normally have the primary responsibility. Depending on the specific measurement, the responsibility may fall to the DBA group, the Architecture Group, or to Capacity Planning. Measurement is usually not a full-time job but it is a job that cannot be forgotten or denied. If there are serious performance problems or availability problems, near real-time awareness is paramount so that the responsible persons can be alerted and effective actions can be taken.
Means to Measure
A number of the data warehouse products have built-in abilities to capture and to report on the system. These metrics can be accessed and delivered in a form (with some effort) that is meaningful to the technical people, to the users, and to management. In addition there are add-on products that supplement in various areas such as data quality and performance. Quite often, organizations have these means of measuring but are either unaware of their existence, or no one has been assigned to execute the measurements and to then take action on their results.
Use of Measurements
We need to measure because the data warehouse should always be considered a work in progress. None have gone in without the need to make changes. There is always opportunity to enhance the data warehouse and, in fact, without enhancement, a data warehouse would rarely meet any specified measure of success. The process should always be to measure, identify problems and opportunities and take appropriate action to solve the problems and exploit the opportunities.
Chargebacks are rarely welcomed (no one wants to have additional costs assigned to his or her department) but for those organizations that do charge back the use of their systems to the departments that employ them, metrics are critical to an equitable distribution of costs. This becomes even more important when money is transferred from one organization to another.
Reporting results to management
Management wants to know how things are going. They just spent $5,000,000 for the data warehouse and they want to know if they are getting their money’s worth. Are the users using the system, are they happy, and are they achieving the benefits they were expecting?
Management is usually content with monthly metric reports unless there are serious problems. In which case, management will want to be briefed more frequently on the problems, the steps that are being taken to resolve those problems, and results of the resolutions. Metrics should be reported with just the information that is of interest to each manager. A good approach is to use conventional data warehouse tools accessing a small metrics data mart. Any metrics that represent problems should be highlighted or shown in red. A dashboard is appropriate for metrics of performance and availability.
Each organization should identify the metrics they will need and use as they continually work to improve their own data warehouse. An understanding of the appropriate metrics, the responsibility for gathering the metrics, and the use of those metrics can make the difference between success and failure of the data warehouse project.
About the Author
Sid Adelman is a principal consultant with Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses, performing data warehouse and BI assessments, and in establishing effective data strategies. He is a regular speaker at “The Data Warehouse Institute” and IBM’s “DB2 and Data Warehouse Conference”. Sid chairs the “Ask the Experts” column on www.dmreview.com, and has had a bi-monthly column in DMReview. He is a frequent contributor to journals that focus on data warehousing. He co-authored one of the initial works in data warehousing, Data Warehousing, Practical Advice from the Experts, and is co-author of Data Warehouse Project Management with Larissa Moss. He is the principal author of Impossible Data Warehouse Situations with Solutions from the Experts and his newest book, Data Strategy, was co-authored by Larissa Moss and Majid Abai. He can be reached at 818 783 9634and firstname.lastname@example.org. His web site is www.sidadelman.com.