Enterprise Data Modeling – Is It Worth It?
By Larissa Moss
Enterprise data modeling (EDM) got a bad reputation in the late 1980s, and for good reason. It took too long to model the entire enterprise, and the model had little or no effect on how systems were built. It was perceived as a wasted effort, and the practice was abandoned by most companies. However, in the late 1990s, it became clear that we had thrown the baby (sound data management principles) out with the bathwater (tedious and ineffective approaches). As a result, new disciplines started to emerge, such as metadata management, data governance, data stewardship, master data management, enterprise information management, information center of excellence, and so on. These new disciplines all strive to achieve the same goal as EDM, namely to create and maintain an inventory of data assets that are unique, accurate, reusable, and traceable. Can these new disciplines replace EDM or does EDM still have a place in the data management practice? The few data administrators who survived the layoff sweep in the early 1990s believe it still does.
Back in the early 1980s, organizations that embarked on EDM followed the best practices of that time and attempted to model the entire enterprise all at once. The business people who participated in the modeling sessions gained invaluable insights into the business processes and the data of the organization, but they had little to show for their efforts besides a lot of entity-relationship diagrams decorating their cubicle walls. The main reason for this was that creating the EDM was never part of any system development project because the big-bang approach took too long and would have delayed the implementation of new systems. By the time the EDM was completed, the systems had already been developed independently and without adherence to the EDM. When money got tight in the late 1980s, business people were wondering what enterprise data modelers were contributing to the bottom line, and many EDM efforts were stopped.
Big-Bang Pros And Cons
Advocates of the big bang approach will still argue that the benefits of big-bang EDM outweigh its shortcomings. They will point out that unless all major business entities and their relationships are identified up-front, EDM will contain errors and will need to be reworked. They will also declare that unless all the attributes pass all of the normalization rules, some of the attributes are guaranteed to be misplaced and will need to be moved. In addition, they will say that unless all the definitions, domains, and business rules are defined for all the attributes, the EDM cannot be considered complete or accurate. And, if that EDM, fraught with all those errors, were to be used to design one or more application databases, those databases would have to be redesigned as the errors are discovered, which would have a domino effect on all the programs accessing those databases. While there is some truth to all of these points, the arguments against a big-bang approach far outweigh these risks.
There are two practical arguments against the big-bang approach. First, it takes much too long to find and resolve all the conflicting data views and business rules that currently exist among business people, not to mention finding and resolving all the data redundancies and data inconsistencies that currently exist among systems. Second, it would involve too many participants from too many business areas (e.g., various data owners, data stewards, subject matter experts, and information consumers) to be involved in all the modeling sessions. The coordination of people’s availability and meeting-room availability alone would be a major challenge, not to mention the facilitation effort to keep these meetings productive and moving forward. The big bang approach required all major participants be involved throughout the data modeling process, otherwise the EDM could not be considered complete or validated. However, this became a physical impracticality.
Incremental Pros And Cons
The arguments against an incremental approach to building the EDM are the very same arguments used to defend the benefits of the big-bang approach.
Building the EDM incrementally means to build a project-specific, fully normalized, fully attributed, fully defined, and fully documented logical data model during the development of each system, but to limit the logical data modeling activities to only those data elements that are in the scope of that project. Assuming that no prior EDM exists, the first project-specific logical data model is the first iteration of the EDM. The second project-specific logical data model, which is also limited to the data elements that are in the scope of the second project, is then merged with the first one to produce the second iteration of the EDM. Any discrepancies, ambiguities, or inconsistencies between the two models are presented to the participants of both projects to be resolved. This process is repeated with all subsequent projects. This way, over time, an EDM can be achieved with incremental value obtained along the way because the systems that are being built are now sharing the same data names, data definitions, business rules, and so on.
Note: To avoid excessive rework, the normalization rules must be followed without exception. That is the only way to ensure that the disjointed pieces of the EDM will fit seamlessly as they are assembled from the various project-specific logical data models over time.
The benefits of incrementally building the EDM can be reaped immediately. The additional effort and time it takes to produce a project-specific logical data model is miniscule in comparison to a big-bang effort. This approach involves fewer participants at one time because of the reduced project scope. The validation time is also shortened due to the reduced number of participants. Business rules and business metadata can be collected, validated, refined, and revised slowly over time, which comfortably accommodates the rate of change in business policies in most organizations. Errors, misunderstandings, omissions, hidden obstacles, and other unavoidable issues can be resolved incrementally, which gives the organization time to mature into a learning organization; one that has increased capacity to adapt to changes rapidly.
Another debate that often occurs is the topic of top-down versus bottom-up logical data modeling. This is an ill-conceived debate because both techniques must be applied to ensure a robust and realistic logical data model, regardless of whether you are creating a project-specific logical data model or the EDM. Since the author recommends building the EDM incrementally, we will use the project-specific logical data models as examples to continue our discussion of modeling approaches.
Top-down logical data modeling refers to the technique of facilitating fact-finding discussions among participants from various business areas, who own, originate, or use the data. The logical data modeler (I use the term logical data modeler deliberately to distinguish between a business data modeler and a database designer) assigned to the project is usually the facilitator of these discovery sessions. He or she helps the participants to identify the major business entities, the data relationships among them, the unique identifiers of those entities, the most significant and obvious attributes of those entities, and the main business rules for those attributes.
This top-down fact-finding activity produces a first-cut “straw-man” logical data model. The logical data modeler and one of the participants then go through several iterations of revising and fleshing out the details of this project-specific logical data model, while writing the definitions for the entities and the attributes. Throughout this process, they periodically call back the participants to validate their work using follow-up top-down discovery sessions. Once the logical data model for the project is deemed to be finished, the “forward-engineering” function of the data modeling tool, which was used to capture the logical data model, can then produce the first-cut physical data model that the DBA can use to start his or her database design activities.
Using the top-down data modeling technique produces a relatively valid logical data model in a relatively short time. However, there are several inadequacies using this technique alone.
The project-specific logical data model will most likely not be complete because it is impossible to think of all the details during discovery sessions without performing some amount of “bottom-up” analysis of the operational source files, databases, and programs. That type of analysis often reveals additional attributes, relationships, and business rules, and in some cases even additional entities.
The project-specific logical data model will most likely not reflect the correct representation of all the data because it was based on the participants’ understanding of their business or their industry and on the business rules they thought “should” govern the data. Analyzing the existing source files and databases often reveals that some of the stated business rules are either not enforced or incorrect.
Top-down discovery sessions are extremely effective for finding and resolving many high-level issues with diverse participants in a short time, but they are not helpful when performing detailed data analysis and research. On the contrary, the more people that try to analyze the same thing, the longer it will take. In addition, participants get anxious and feel they are wasting their time when the research activities bog down in laborious details.
Bottom line, top-down logical data modeling only goes so far before it has to be augmented with bottom-up analysis.
Bottom-up logical data modeling refers to the painstaking task of normalizing existing process-oriented (denormalized) data structures into a “best-guess” logical data model. The steps include:
Reviewing record layouts of files and the DDL of relational databases to locate potentially embedded (hidden) entities.
Tracing through primary and foreign keys to locate inferred entities and their potential data relationships to other entities.
Painstakingly applying the normalization rules to each data element.
Converting technical column and table names to fully qualified business names and assigning prime, qualifier, and class word components. (An example of a standardized business name for a data element is: Monthly Checking Account Balance. The main component (prime word) is “Account” which is further qualified by the word “Checking” to indicate the type of account. The class word or type of data value contained in the element in this name is “Balance” which is further qualified by the word “Monthly” to indicate the type of balance.
This process is sometimes referred to as reverse engineering, which is a common feature of data modeling tools. However, the reverse-engineering capabilities of these tools only cover a small portion of the 4-step process for bottom-up logical data modeling – and only if the existing data structure is a relational database. Using the reverse engineering function, the data modeling tools produce a physical data model of the underlying database by tracing through the primary and foreign key relationships between tables in the database (step 2), and then simply calling the tables “entities” and showing the traced key relationships as data relationships among the entities. The tools do not automatically perform any analysis, normalization, or business name qualification. In some cases (for example, in case of a normalized database implementation), such a physical data model can be a useful starting point to approximate “best-guess” logical business entities. The logical data modeler can use these inferred potential entities to continue the normalization process. But without further normalization, a reverse-engineered physical data model is never equivalent to a true bottom-up logical data model, and thus could not – and should not – be used “as-is” for EDM. And it is doubtful that tools will ever be able to mimic the human reasoning and judgment processes that are intrinsic to bottom-up logical data modeling.
As just described, using the four steps of the bottom-up logical data modeling technique produces a relatively complete project-specific logical data model. However, there are also several inadequacies using this technique alone.
The bottom-up approach “best-guess” logical data model cannot be trusted until it is meticulously reviewed and validated by all the business people who own, originate, or use the data.
The project-specific logical data model is immediately very large and complex because it includes all the details from the existing source files and databases from the start. This makes it overwhelming for some business people to review the logical data model, and some even refuse to participate in any review sessions. This immediate complexity is also problematic for some less-experienced logical data modelers who can easily get lost in the ambiguities and open questions during the iterative refinement process.
The project-specific logical data model will most likely not reflect the correct representation of all the data because it was based on the built-in structural limitations of the existing files and databases, and on the programmed-in business rules of the existing applications. When business people are asked to validate these derived logical data models, they frequently point out incorrect or missing business rules, data relationships, attributes, and even entities.
Bottom line, bottom-up logical data modeling only goes so far before it has to be augmented with facilitated top-down validation. As you can see, both the bottom-up and top-down approaches bring solid benefits, but each is somewhat incomplete without the other.
Data reusability requires having an inventory of all unique business entities and their unique attributes, which in total represent the “single version of the truth” about the organization. This inventory can most easily be achieved through EDM. An EDM can be created incrementally, as it evolves over time when the project-specific logical data models are merged one by one, a project at a time. The project-specific data models must use a combination of top-down and bottom-up logical data modeling techniques to be complete and correct, including the six normalization rules, which reliably place each unique atomic attribute into its one – and only one – owning entity.
About the Author
Larissa Moss is president of Method Focus Inc., and a senior consultant for the BI Practice at the Cutter Consortium. She has 27 years of IT experience, focused on information management. She frequently speaks at conferences worldwide on the topics of data warehousing, business intelligence, master data management, project management, development methodologies, enterprise architecture, data integration, and information quality. She is widely published and has co-authored the books Data Warehouse Project Management, Impossible Data Warehouse Situations, Business Intelligence Roadmap, and Data Strategy. Her present and past associations include Friends of NCR-Teradata, the IBM Gold Group, the Cutter Consortium, DAMA Los Angeles Chapter, the Relational Institute, and Codd & Date Consulting Group. She was a part-time faculty member at the Extended University of California Polytechnic University Pomona, and has been lecturing for TDWI, the Cutter Consortium, MIS Training Institute, Digital Consulting Inc. and Professional Education Strategies Group, Inc. She can be reached at firstname.lastname@example.org