Data Mapping – Out Of The Gate
Whenever you are embarking on a lengthy adventure to a location you haven’t been, it always starts with some high level planning. When planning my vacations, once I have an idea of where I want to go and whether it is feasible, I typically start planning the various approaches to travel, the high level events I am excited about, and then factor in costs/budgets. This exercise usually starts with 15 or 20 things on a list and ends up with about 6 – 8. Those 6 or 8 are the events I can actually attend.
One of the foundational elements of sound data design for an enterprise data warehouse is the Subject Area Model (SAM). A SAM is a data classification tool that identifies the top 10 – 30 subjects of data that best define your organization (while most industries I would never recommend you go over 15 subjects, healthcare is almost impossible to get by with less than that and more frequently approaches the 30 number). These subjects are then used in classification, planning, governance, meta data, and even incremental data warehouse development.
One common data warehouse tool that is used more specifically in the design and development of ETL jobs is the Source to Target Map (S2T). A S2T Map is used to define the data that will be pulled from source systems, how it will be validated, transformed, and ultimately, where it will land in your target data warehouse.
In this article we are going to look at enterprise data warehousing efforts and define how a tracking mechanism for the mapping of sources to subjects will help you plan and design an incremental approach to building and populating your data warehouse.
There are many examples of enterprise data warehouse development efforts marred by projects that get well into development, sometimes even testing, before realizing data was not available, accurate, or valuable. Most of these efforts have their origins in good intentions and drive to get moving quickly so that significant progress can be made. Unfortunately, without a quantified, prioritized list of sources and use cases, these efforts often end up in spending inordinate amounts of time and funding only to be unable to deliver a solution that realizes significant value. Too often it is a case of asking one user what they want and starting to build without understanding the viability of the request and its value to the rest of the organization.
Generally the trend in enterprise data warehouse design is to move straight from a Subject Area Model into the creation of detailed data models and then source to target maps. In many industries, the number of source systems and the amount of data to model /define is small enough where this approach is more than adequate. However, in industries like healthcare, where dozens or even hundreds of source systems exist in one corporation, it isn’t feasible. Healthcare also carries the challenge of a massive landscape of solution demands and needs that stretch across practice areas, operational responsibilities, external demands, finances, and research. Unfortunately, this is often where the concept of boiling the ocean then rears its head.
I would like to suggest that rather than abandoning the enterprise aspect and associated processes, try refining the approach. In this case I am referring to the use of a Source to Subject map (S2S).
If your goal is to incrementally build a data warehouse you generally have two choices – build it a source system at a time, or build it based on the target a subject at a time. If your goal is to have data that is truly integrated, I would recommend a subject at a time. This enables you to only model/design the data target that you need for initial value – if you model based on individual sources, your model will look more like your sources and integration of additional systems will be extremely challenging or even impossible. For example, many healthcare organizations can satisfy significant cross functional needs by gathering data on these subjects right away – patients, providers, diagnosis, medications, procedures, and appointments. If you already have a SAM that describes the areas of data in your organization, the ability to incrementally build out your warehouse is now enabled a subject at a time. Before you begin your detailed data modeling efforts you should undertake the S2S Mapping exercise. This will help you tie the source content and quality of the data to demonstrable use cases that demonstrate clear value while also creating significant enough information that can be used to develop a validated prioritization of data and usage scenarios.
An effective source to subject map consists of two main areas – the sources and subjects.
Sources: Should not necessarily contain every system you have in your organization, but should contain the systems that contain the data that is of value to your organization. Again, this is most applicable to industries like the healthcare industry where data often resides on many different systems. Identifying the “gold standard” sources can help whittle your list of key sources down dramatically. Information captured should include; history, volume, key content, source experts, and any other associated meta data.
Subjects: This consists of the subjects of data you identified in the subject area modeling exercise. All of the descriptions, examples of data, and any associated meta data related to them.
Validating and Prioritizing
Effective prioritization of the solutions produced by a data warehouse is unquestionably a governance responsibility that rests outside of IT. However, IT should be producing all of the materials that enable the warehouse oversight group to make educated decisions that are achievable in a reasonable timeframe. To avoid projects that never succeed or don’t even get off the ground, no prioritization of these efforts should occur without having information about the data availability, data quality, costs of acquisition, timeframe to deliver, and project risk.
Now that you have a S2S listing of how the data sources map to the subjects, you can begin a process of identifying which use cases map to data that is available and of a high enough quality where it is ready for usage. To keep detailed requirements gathering to a minimum, at this stage high-level use cases that have some specific samples are the most detail you need for this step. Once you complete mapping of use cases to data, the oversight can now look at the value of the use cases, the correlations to how many use cases can be solved by implementing data needed by many rather than just one, and look at the time it would take to actually integrate the data.
This process is then finalized by creating a roadmap of the data, solutions, and processes that can occur over the next few years.
Creating information based prioritization processes that interact with the leadership and governance that oversees your warehousing efforts is critical for getting the most value out of your efforts. By having a business driven planning process that leverages a high-level S2S map before any development begins, you significantly increase the odds of project success as well as increasing the overall value that will be received by all of those future users of the system.
About the Author
Bruce has over 20 years of IT experience focused on data / application architecture, and IT management, mostly relating to Data Warehousing. His work spans the industries of healthcare, finance, travel, transportation, retailing, and other areas working formally as an IT architect, manager/director, and consultant. Bruce has successfully engaged business leadership in understanding the value of enterprise data management and establishing the backing and funding to build enterprise data architecture programs for large companies. He has taught classes to business and IT resources ranging from data modeling and ETL architecture to specific BI/ETL tools and subjects like “getting business value from BI tools”. He enjoys speaking at conferences and seminars on data delivery and data architectures. Bruce D. Johnson is the Managing director of Data Architecture, Strategy, and Governance for Recombinant Data (a healthcare solutions provider) and can be reached at firstname.lastname@example.org