Probable Cause Of Religious Wars
By Larissa Moss
I’m talking about the occasional religious wars we wage in IT. Here are some examples: What type of requirements modeling is better – data flow diagrams or entity relationship diagrams (circa 1976)? What type of DBMS is better – hierarchical or relational (circa 1982)? What type of data warehouse design is better – relational or multidimensional (circa 1996)? What type of methodology is better – waterfall or agile (circa 2001)? What type of agile methodology is better for DW/BI – software-centric Scrum/XP or data-centric Extreme Scoping[TM]? The latter appears to have been triggered by my last article “Beware of Scrum Fanatics on DW/BI Projects.”
After the article was published, I received the usual handful of emails from readers, offering their comments. Only this time, the emails fell into two distinct camps. The data people completely understood and agreed with all the points I made. Some Scrum people who wrote to me didn’t. Both camps seem to be equally convinced and passionate about being right. That reminded me of a previous feud between Bill Inmon and Ralph Kimball who argued vigorously about the best design for a DW. Both of them were equally convinced and passionate about being right. They both used the same term “data warehouse” but had completely different understanding of and philosophy about what a DW was or what it should be. That’s when I had an “Aha.” Religious wars erupt when people use the same terminology but have a completely different interpretation of what the term means. Let’s take a look at some important DW/BI terms and what they mean to different people.
Data Warehouse
Wikipedia (http://en.wikipedia.org/wiki/Data_warehouse#History_of_data_warehousing) offers the following history of data warehousing:
“The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the “business data warehouse”. … In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. … Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources was typically in part replicated for each environment. …Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to “retail stores” or “data marts” that were tailored for ready access by users.”
It seems that Bill Inmon and Ralph Kimball interpreted the need and impetus for data warehousing differently because they developed completely different design and implementation solutions for a DW. Bill Inmon obviously tried to tackle the data management aspect of a DW by creating a trusted database as the “one central point” addressing the original reason for a DW, namely “gathering, cleaning and integrating data from various sources” once and only once – and then reusing that data from that point forward. Ralph Kimball obviously tried to focus more on the data delivery aspect of a DW by considering it to be the “one central point” in terms of a collection of “retail stores or data marts that were tailored for ready access by users.” Needless to say, these two gentlemen had a completely differently understanding and philosophy of a DW, and both created the “best” database design for their interpretation of a DW.
This discussion is still relevant when deciding which agile method is the “best” to use for your DW/BI projects. It depends on your understanding and philosophy of what a DW means to your company. Is your company’s priority data management or data delivery? Of course, everybody addresses both to some degree, but what is the overriding objective for your DW/BI initiative? What is more important to your company? Gathering and delivering data from your operational systems as quickly as possible or taking the time to standardize, integrate, cleanse, document and store the data in a way that it can be reused by everyone in the organization? These are two completely different objectives that require different activities, different skill sets, different participants, and a different type of project effort. The first is a software-centric project because the primary goal is to get data into users’ hands as quickly as possible. In too many projects of this type, I have seen the term data warehouse used to describe a data dump, which is either a database of unrelated tables or a database logically partitioned to hold copied replicas of related source data tables (normally called a persistent staging area [PSA] database). The second is a data-centric project because the primary goal is to clean up the existing data chaos and store a trusted pool of integrated, ratified and reusable data. My point is that the first can be accomplished with any of the popular software-centric agile approaches (with just enough modifications to address data quality) but the second needs a data-centric agile approach that accommodates all the data management activities Michael Brackett describes in his two books: The Data Warehouse Challenge: Taming Data Chaos and Data Resource Quality: Turning Bad Habits into Good Practices.
Business Intelligence
You may be surprised to learn that the term BI is much older than the term DW. Here is an excerpt of its history and definition from Wikipedia (http://en.wikipedia.org/wiki/Business_Intelligence#History):
“In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He defined intelligence as: “the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” In 1989 Howard Dresner (later a Gartner Group analyst) proposed BI as an umbrella term to describe “concepts and methods to improve business decision making by using fact-based support systems.” It was not until the late 1990s that this usage was widespread. … Business intelligence (BI) refers to skills, technologies, applications and practices used to help a business acquire a better understanding of its commercial context. Business intelligence may also refer to the collected information itself. BI applications provide historical, current, and predictive views of business operations. Common functions of business intelligence applications are reporting, OLAP, analytics, data mining, business performance management, benchmarks, text mining, and predictive analytics.”
Before I introduce a definition of BI from The Data Warehousing Institute (TDWI), I want to point out the distinction made by Wikipedia (and industry experts) between BI and BI applications. I do that because I have seen many companies think that the collection of their BI applications is “BI” itself.
TDWI defines BI as a program that is best described by its three-layer framework, which is taught at almost every TDWI conference in the course TDWI Business Intelligence Fundamentals: From Data Warehousing to Business Impact. The first layer is the business layer, which requires participation from the business community to identify, agree upon and prioritize business requirements and business value, and to help manage the program as well as the interdependent projects and resources. The second layer is the administration and operation layer, which requires the collaboration between the businesspeople and the DW/BI teams. This layer includes the establishment of common standards, conventions, processes and procedures to be used in the BI environment, such as database architecture, ETL architecture, project organization, technology and tool standards, proper use of data in BI applications, ongoing DW/BI operations, and data governance. The third layer is the implementation layer– or as I like to call it: the plumbing of BI. This layer is divided into two sections: data warehousing and information services – or as Claudia Imhoff likes to say: the “get the data in” and “get the data out” aspects of DW/BI. Data warehousing includes identifying potential data sources and having businesspeople agree on the most appropriate ones, naming and describing data elements and having businesspeople agree on the business metadata, identifying dirty data and writing agreed upon cleansing specifications and code, collectivelyarchitecting and incrementally developing the ETL process and target databases (ODS, EDW, PSA, DM), populating the target databases and reconciling them back to the source systems as well as to each other (where appropriate). Information services also has two components: information delivery and business analytics. Both refer to applications, processes and procedures to turn trusted DW data into information and to train the business community how to use this information appropriately and effectively. Information delivery refers to the “state of the practice” information tools and services like OLAP tools, querying and reporting tools, and data mining. Business analytics refers to “state of the art” business-focused performance monitoring applications, such as scorecards and dashboards about customer profitability, market basket analysis, and supply chain analysis.
How does all this relate to our latest religious war about which agile method is the “best” to use for your DW/BI projects? It depends on your understanding and philosophy of what BI means to your company. Once again, is your company’s priority data management or data delivery? If it is data management, you will have to spend considerable time and effort on all three layers of BI as described by TDWI and you will need a data-centric agile approach. On the other hand, if your company just expects you to concentrate on delivering BI applications with or without a central database (whatever you call it and however you architect it) and calls that BI, then using software-centric agile methodologies with the appropriate adjustments will probably work.
Data Integration
The term data integration also means different things to different people. Some people, especially technicians, think of data integration in technical terms. To them, data integration is nothing more than gathering data from multiple sources, cleaning it up a bit if required, and then storing it in one central database. The data people call that data consolidation. Data integration goes far beyond data consolidation. In addition to consolidating data, integration enforces data uniqueness – the building blocks of the “single version of the truth” that enable you to reuse the same data. What I mean by data uniqueness is that every data element has one and only one semantic meaning, one and only one unique name that is not being reused for a different data element by a different user, one distinct definition that is not applicable to any other data element, and one set of approved values (domain) and business rules. Data uniqueness does not mean that it can physically exist in only one table on one database. But it does mean that if a data element is stored redundantly for any reason, the name, definition, and domain of the redundant occurrences are forced to be identical to its original. Most companies do not enforce data uniqueness in their operational systems. Most companies don’t even inventory their data. Most companies have no idea how many synonyms a data element has (i.e., how many different names the same data element has across different systems) or how many homonyms exist (i.e., how many times the same name is reused for different data elements) or how much dirty data they have (i.e., data elements with values that violate specified domains and business rules).
Some companies think the resolution is simple. Just assign a data owner to the various data subject areas and let the owners dictate the names, definitions, and domains. But it’s not that simple. Not all data owners have the highest rank and most political power in the company, and their dictated names, definitions and domains are often not accepted by other businesspeople. The result is either compromised quality of business metadata or worse, redundancy and inconsistency of what is essentially one and the same data element.
Data governance is a practice that tries to address this issue. A prerequisite for successful data governance is the willingness and availability of businesspeople to get together to discuss the data discrepancies as they are being exposed with the designated data owners and the facilitating data administrator (not the same profession as database architect). Then the businesspeople (data owners and information consumers) have to agree on semantically correct and standardized data names, definitions and domains that will be acceptable to everybody in the organization (regardless of pay grade). But first, the data administrator – hopefully with help from data stewards or business analysts – has to find the synonyms, homonyms, and dirty data. If this sounds like a monumental task, it is. Therefore, the businesspeople have to categorize their data into what is mission critical that cannot tolerate any discrepancies, what is important but not necessarily mission critical so that a small threshold (maybe 1-10%) of dirty data is acceptable, and data that is insignificant from a decision making perspective that can bear a larger threshold of dirty data. Keep in mind that what is deemed insignificant by one businessperson can be mission critical to another. The rule must be that the highest categorization wins. Otherwise you will open the door for redundancy and inconsistency again. Once these and other guidelines are established, data archeology can begin.
The most effective technique to find, analyze and resolve synonyms, homonyms and dirty data is to apply the six normalization rules during entity-relationship modeling with the business community. Data people have called this process logical data modeling almost since Peter Chen invented the technique in 1976. It has nothing to do with database design. This is businessarchitecture, not database architecture. Modeling the business from a data perspective is notjust mapping existing source data names, definitions and domains into a business data model. Instead, it requires the following actions:
-
Examine the name, definition, and the semantic intent of each logical entity to find potential duplicates of business objects that would otherwise not be discovered because the objects are known under different names in different systems (synonyms). At the same time, identify all business objects that appear to be different, yet businesspeople call these different objects by the same name (homonyms).
-
Ensure that each entity instance has one and only one unique identifier (business key), which, in turn, is never reassigned to a new entity instance even after the old instance expired and was deleted from the database.
-
Use the six normalization rules to put “one fact in one place”, which means one attribute(source data element, not measure) in one and only one owning entity. This means that an attribute can be assigned to only one entity as either an identifier of that entity or as a descriptive attribute of that and no other entity. This modeling activity ensures that each attribute is captured once and only once, and that it remains unique within the data universe of the organization. Hence, the “single version of the truth.”
-
When attributing the entities, examine the name, definition, and the semantic intent of each data element to find potential duplicates that would otherwise not be discovered because the data elements are known under different names in different systems (synonyms). At the same time, identify all data elements that appear to be different but are known to the businesspeople by the same name (homonyms).
-
Determine and document the domain (allowable values) for all attributes. Ensure that each value is descriptive of and congruent with the attribute’s name, definition and semantic intent. In other words, do not overload attributes.
-
The last and most important activity is to capture the business actions (business processes) that connect the business objects in the real world. These business actions are drawn as data relationships among the entities. It is paramount to capture them from a logical business perspective (not from a reporting pattern or data storage perspective) because these relationships are the basis for all potential access patterns, known and unknown, now and in the future. Business people will never ask for information about two completely unrelated business objects because any such information would be irrelevant, if not nonsensical.
You may ask again: How does all this relate to which agile method is the “best” to use for your DW/BI projects? It depends on your understanding and philosophy of the term data integration. If your primary focus is data management then (using the 80/20 rule) you will spend 80% of your project effort on data-related activities and you will most likely want to use a data-centric agile approach, which bases project scope and schedule on data efforts and not code complexity and which does not force you into unreasonable deadlines or cadence. If, on the other hand, your company is only interested in data consolidation with little or no data integration as I described above and little or no data management as Michael Brackett describes in his books, then you can probably modify one of the popular software-centric agile methodologies to fit your project.
Conclusion
One of the comments I received to my last article hit the nail on the head. I will paraphrase: As long as the chosen agile methodology helps the project team produce the results the businesspeople want, there should be no religious war over it. I agree in principle. But I would argue that many businesspeople lack education and understanding of DW and BI. Many think that BI applications are customized systems for their own consumption similar to the old traditional decision support systems – the only difference being the use of new BI technology. That may actually be an acceptable or even desired result in some companies. However, other companies have matured to a higher level on the BI Maturity Model. They place a higher priority on data management than they do on data delivery, in which case their education, understanding and interpretation of DW/BI terms and efforts must match the agile methodology you choose. In closing, I would like to recommend reading about Wayne Eckerson’s BI Maturity Model in his book Performance Dashboards: Measuring, Monitoring, and Managing Your Business, ISBN 0471724173, Chapter 5.
About the Author
Larissa Moss is president of Method Focus Inc., and a senior consultant for the BI Practice at the Cutter Consortium. She has 27 years of IT experience, focused on information management. She frequently speaks at conferences worldwide on the topics of data warehousing, business intelligence, master data management, project management, development methodologies, enterprise architecture, data integration, and information quality. She is widely published and has co-authored the books Data Warehouse Project Management, Impossible Data Warehouse Situations, Business Intelligence Roadmap, and Data Strategy. Her present and past associations include Friends of NCR-Teradata, the IBM Gold Group, the Cutter Consortium, DAMA Los Angeles Chapter, the Relational Institute, and Codd & Date Consulting Group. She was a part-time faculty member at the Extended University of California Polytechnic University Pomona, and has been lecturing for TDWI, the Cutter Consortium, MIS Training Institute, Digital Consulting Inc. and Professional Education Strategies Group, Inc. She can be reached at [email protected]