Building Scalability Into Your Warehouse

By Bruce Johnson

The ability of your infrastructure to grow with the needs and usage of your solution is what defines it as being scalable.  Most architects are comfortable with how to appropriately select and implement hardware for our solutions that is scalable.  Data warehousing and analytics involve users getting access to information to perform analysis where it may be impossible to predict the various ways they will use it and the various metrics they will need.  Once a user sees information and is able to analyze that information effectively, it typically generates many more questions that you couldn’t have predicted beforehand.  How do we go about making a data model and solution development approach that can also be scalable with the needs of the user?

In this article we are going to look at an approach that will allow you to develop your warehouse and associated metrics and tools in a manner that enables scalability in usage.


Root Level Data Is Key

Root level data is best defined as the native data elements as they should be captured. Ideally, if you capture the data in its native state, you can then apply many future views to that data.  The most simplistic example of this is the concept of birth date.  If you capture someone’s birth date, you are now able to calculate the age of that person at the time of any event that may be tied to you business cycle.  That doesn’t mean one date, that means unlimited dates.  If for example, I have the date of a patient’s surgery captured, I can compare that to their age to calculate the age at time of surgery.

The goal is to enable many ways to look at the data.  Get this root data right and many options are now available.  It is also much easier and faster to gather root date.  Rather than spending significant amounts of time during requirements trying to define all of the metrics, calculations, derivatives, and views, focus on the capture and integration of root level data.  The metric and view definition should come in the actual usage of the data, not in the initial storage.  You may decide to store data in a summarized or aggregated manner once you have worked out those metrics, but you should never store it summarized or aggregated while getting rid of the initial root data.  Any future needs that come up that require that root data are then incapable without significant effort/rework.

Organizations that have a mature data warehousing environment that leverages comprehensive atomic level data, using this approach, can build many analytics solutions in rapid fashion by focusing on the usage of root data already integrated.  If the data isn’t truly integrated or modeled as root data, realizing this value is probably not going to occur.


Divide And Conquer Data Modeling

Another key to success lies in having a subject area model that divides major parts of your model.  This is a must for all large or enterprise level data warehouses and should take little effort to achieve.  Too often Subject Area Models are a massive undertaking that is misrepresented and misunderstood.  Without one, you either take a “boiling the ocean” approach to data definition, or you model parts of the model a project at a time, which invariably results in rearchitecting the model project after project (which in turn means rebuilding your database, tool access, and metrics).

Once you have a Subject Area Model in place, you can now build out the data model a subject at a time, or if you are really savvy in your project approach, parts of a subject at a time.  It is very reasonable to model a subject in parts, but that approach would take a lot more space than is available here and now to explain, perhaps later.  This divide and conquer approach to data modeling has to be driven by layers of priorities.  In order to be effective, you must choose what to model based off of these factors:

  • Business Value – Have specific use cases and areas of analysis that would add immediate business value.  These opportunities for business growth or improvement are identified at a high level and then the next 3 items are applied to them to enable an appropriate priority
  • Data Availability – Even though business value for an opportunity could be sky high, if the data to support it isn’t available or isn’t even captured, you are best not to start here.
  • Data Quality – The quality of the data captured is also critical to prioritizing opportunities.  That doesn’t mean the level of data must be high, but it must be appropriate enough to satisfy the business usage.  Some efforts may require perfect data, while others just require access to whatever data you have.
  • Showstoppers – Most organizations have several stories of projects that were started and progressed very far down the path before being cancelled, partially completed, or flat out dropped in their tracks.  There are many hidden showstoppers around analytics opportunities that need to be understood before prioritizing your efforts.


Separation Of Data Integration And Data Delivery

When you are taking a large/enterprise data warehousing approach, it is extremely important to separate the data modeling and data integration activities from the application development or delivery of data via applications.  By separating logical areas of the data by Subjects, you can now focus on the most valuable or easiest to acquire subject areas first, getting the process down and receiving real value without completing the whole project in one shot.  While some see this as limiting in that one user won’t get everything they want right away, it is actually the opposite.  Many users will get more value initially and many more will get full value in the long run.  In comparison to other approaches, overall cost will go down and value will go up significantly.

This approach allows Data Integration resources to focus on getting all of the root level data acquired (one of the most difficult aspects to any data warehouse project), while your BI resources focus on defining the metrics and display mechanisms that the users need.  This also helps enable many different applications and even many different user areas to be served off of the same data.  Each user area or group will be able to have their specific view of the data separate from how other areas view the data.



Scalability in hardware and development approaches is a value proposition that allows you to implement what you need to get started and begin to realize value, knowing that as future needs and resources permit, you can continue to grow and serve a broader audience.  This also greatly helps in seeking funding for your efforts.  No matter the business reasons for funding, it is always a challenge to justify really large efforts.  By taking this approach, you can justify the first component, demonstrate success by delivering as promised, and justify each future effort by understanding specific use cases around the next components of data you integrate into the warehouse.

About the Author

Bruce has over 20 years of IT experience focused on data / application architecture, and IT management, mostly relating to Data Warehousing. His work spans the industries of healthcare, finance, travel, transportation, retailing, and other areas working formally as an IT architect, manager/director, and consultant. Bruce has successfully engaged business leadership in understanding the value of enterprise data management and establishing the backing and funding to build enterprise data architecture programs for large companies. He has taught classes to business and IT resources ranging from data modeling and ETL architecture to specific BI/ETL tools and subjects like “getting business value from BI tools”. He enjoys speaking at conferences and seminars on data delivery and data architectures. Bruce D. Johnson is the Managing director of Data Architecture, Strategy, and Governance for Recombinant Data (a healthcare solutions provider) and can be reached at

Free Expert Consultation