Is It A Data Warehouse or a Replicated Data Store?
By Bruce Johnson
I am fascinated by the wide variety of database solutions in place across all industries that are referred to as data warehouses. It seems that any time data is placed in a database, whether from one source system or from multiple sources, it can be referred to as a data warehouse. In healthcare in particular, the concept of a replicated data store is a common occurrence (unfortunately, few actually call it this) – this is probably due to the complexity and depth of healthcare data in combination with the massive number of source systems. These complexities make true integration of data a significant challenge and providing a Replicated Data Store on one platform at least provides a one stop shop. Replicated data stores can be invaluable for any number of purposes, but they should definitely not be confused with a data warehouse. In this article we are going to investigate some of the characteristics and values of each of these architectural approaches.
Generally, data warehouses are carefully designed databases that hold integrated data for secondary usage. If you only need data from one system, but can’t impact the performance of that system, I would suggest taking a copy – a replicated data store – unless the complexities and width of data is very large and the various ways to require access to it are very high, a data warehouse would be overkill. Whether a data warehouse is normalized or de-normalized, atomic or star schema, there are sound database design principles behind the standard accepted options. These are architectural design principles that are too deep to cover in a brief article.
The key to a design of any successful data warehouse architecture is the ease with which it can be used to get data out (the opposite of a transactional system, which is focused on design for getting data in), whether that is through direct access, or output to data marts. As such, this is exactly where the principle of integration becomes critical. If you put data into a target schema (regardless of normalized or de-normalized), you provide any one accessing the data with the capability to pull it by only knowing what type of data they are looking for, not having to worry about exactly where it came from. Again, in healthcare this is even more important. With the number and variety of systems, establishing “gold standards” for data definition is a requirement of providing easy access. If a user or even a developer has to navigate the naming conventions of dozens or even hundreds of different system schemas to get one result, they will soon fall back to getting data a different way. However, this is also more difficult, requires more IT/business interaction – to create an agreed upon target schema and map the appropriate systems and fields to it properly. I have found that putting the work into getting the data correct and integrating it in preparation for usage results in much happier, more engaged users who depend on the warehouse instead of cursing it.
Replicated Data Stores
A replicated data store is a database that holds schemas from other systems, but doesn’t truly integrate the data. This means it is typically in a format similar to what the source systems had.
The value in a replicated data store is that it provides a single source for resources to go to in order to access data from any system without negatively impacting the performance of that system.
The challenge in a replicated data store is that the only resources that can typically access anything within it are IT resources. Dynamic applications and BI tools typically run extremely poorly when put against replicated data stores because the vast number of joins to connect data requires massive processing power. The majority of data warehouse consulting I have done around performance problems are usually very easy to spot and point directly to an architecture such as this trying to be used to quickly get data out (not to mention they want it easy with good meta data). The architecture isn’t designed for that.
For smaller organizations, one thing that you can easily do to a replicated data store is to add aggregation/summary tables for common views – these can be built programmatically such that they are able to be used by dynamic applications and BI tools effectively. This is not recommended for larger organizations as the sheer number of these will result in data inconsistencies, contentions, mass numbers of rollups and no standard metrics.
In healthcare, if you are unsure of the type of architectural design you have, one key you can ask yourself to define what type of design you have is how you are able to effectively define a group of patients properly. What happens when you look at individual source records from individual systems is that you get a count of what is in that column based on some combination of criteria/characteristics present in other columns. I feel that too often this comes from IT resources that have worked in other, much simpler industries where this is an acceptable practice because the complexities and correlations across data grouping are just not that difficult (I spent the first 18 years of my career in these industries and looking back now it is easy to see the difference – I wouldn’t accept that at first when I started working in healthcare).
If I want a count of diabetics for example, in a replicated structure, I would look for a field that I think would identify that and run record counts – I could add layers and complexities and tie IDs even across systems (with lots of programming and being very error prone). In a true data warehouse, I should be able to ask the target schema for the definition of a diabetic and then even a user should be able to work with that subset of data – easily and quickly.
What is a Landing Zone?
A replicated data store can be that critical component of a data warehousing layer – the landing zone. This is very commonly required in healthcare as the plethora of systems that you require data from exist on a variety of platforms that are on various and even non-standard technical platforms (like Cache). A landing zone allows a data warehousing team to gather data from all systems into one technical platform so that queries, cleansing, validation, and ultimately integration can happen such that it enables loading into a formal data warehouse. The concepts of effective ETL in these cases are radically different from the basic principles that ETL vendors and consultants teach and promote – they may work well in other industries, but the unique challenges of healthcare make them obsolete. Instead, ETL should:
- Separate the E to get data copied into the landing zone
- The T is for crossing the wide variety of systems and resolving identifiers to achieve data cleansing and integration
- The L if for loading the prepared data into the single target schema
The ability to understand the various data architectures that are possible and to be able to identify the one that best suits your environment, are very important when planning/designing an effective data/analytics solution. However, understanding the architecture you already have, when, where, and how to populate it and use it is equally important. Too often healthcare leaders complain that they cannot get information out of the data warehouses that their IT team has built and thus they find other ways to get what they need. This is of negative value to both the business/clinical/research needs and the reputation/respect of IT. If you already have a replicated data structure, but demand for a true data warehouse is high, you already have a significant part of the challenge out of the way. Understanding what steps to take next to truly enable that data for access and usage within your organization is your challenge. Be careful that you don’t throw that out or abandon your replicated database, but also, don’t try to make it your data warehouse by tweaking it. Beating a square peg into a round hole doesn’t make it a round
About the Author
Bruce has over 20 years of IT experience focused on data / application architecture, and IT management, mostly relating to Data Warehousing. His work spans the industries of healthcare, finance, travel, transportation, retailing, and other areas working formally as an IT architect, manager/director, and consultant. Bruce has successfully engaged business leadership in understanding the value of enterprise data management and establishing the backing and funding to build enterprise data architecture programs for large companies. He has taught classes to business and IT resources ranging from data modeling and ETL architecture to specific BI/ETL tools and subjects like “getting business value from BI tools”. He enjoys speaking at conferences and seminars on data delivery and data architectures. Bruce D. Johnson is the Managing director of Data Architecture, Strategy, and Governance for Recombinant Data (a healthcare solutions provider) and can be reached at firstname.lastname@example.org