No Silver Bullets for Data Integration
By Larissa Moss
If you ask ten knowledgeable people for their definition of integration, you are likely to get ten different answers. Most of the answers will be an interpretation of Webster’s dictionary definition for integration, which is “to make whole or complete by adding or bringing together parts.” This could mean anything from two systems passing data back and forth (loosely coupled) to a shared data environment where all data elements are unique and non-redundant and are being reused by multiple applications (tightly coupled).
For example, when creating an integrated view of a customer you may start by bringing together customer-related data elements from multiple sources, such as:
- What the customer filled out in a loan application
- Transactions this customer performs with our organization
- Demographic data on the customer available from external vendors
- Data from sales force automation or order management applications
- Information we get when the customer calls our customer service department or asks for information on our Web site
Unfortunately, a lot of the customer data is usually duplicated in several of these source systems. As a result, the redundant data is often inconsistent. Thus, just bringing together parts (the inconsistent data) to make whole the image of the business entity “customer” is obviously not the only thing we strive for with data integration. Therefore, we have to refine Webster’s definition into a more complete definition of data integration: “To make a business entity or subject area whole or complete by adding or bringing together their unique data elements and storing them non-redundantly (or at least consistently) so that they can be reused by multiple applications.” This adds a new level of complexity to data integration, which technology solutions alone cannot adequately address.
Ineffective “Silver Bullet” Technology Solutions
Realizing that data integration is the key to thriving in the fast-paced information-oriented economy, companies are looking for the silver bullet solution to their data integration problems. However, they discover quickly that the real issues underlying the current “dis-integrated” data chaos cannot be solved by technology solutions without fundamentally changing the way they manage their data redundancy.
Enterprise Resource Planning (ERP)
ERP solutions are a collection of functional modules used to integrate operational data to support seamless operational business processes for the enterprise. ERP products were meant to solve the redundant and inconsistent operational data mess by consolidating operational data into ERP modules. Good idea – if properly implemented. Properly implemented meant that this was a chance to perform extensive research on the data to be converted into the ERP modules. Extensive research would have included analyzing all data elements for their definitions, contents, semantics, and business rules. This extensive business analysis activity would have included finding and fixing gaps in business knowledge and lost data interrelationships. Correcting existing data impairments should have been a mandate for every ERP conversion. But it rarely, if ever, was. Instead, most ERP conversion projects were performed the old-fashioned way using traditional source to target mapping, which does not include extensive data domain (content) analysis, such as:
- Finding data elements which have multiple meanings and are used for multiple purposes
- Finding missing relationships between business entities
- Finding and resolving duplicate keys
- Validating data contents among dependent data elements across files
- Deciphering, separating, and translating one-byte or two-byte codes
- Finding and extracting processing logic embedded in the data values
Therefore, as a result of performing a traditional conversion, the promises of the ERP technology solution to solve data integration and data quality problems did not materialize to the extent expected.
Data Warehousing (DW)
Another serious attempt at integrating data has been without question the drive towards data warehousing. The definition of data warehousing speaks to this attempt: “A DW delivers a collection of integrated data used to support the decision making process for the enterprise.” A solution at last, and a sound plan – if properly implemented. Properly implemented once again meant extensive analysis of the operational data to find the data redundancy and data inconsistency problems and to correct them. This detailed analysis had to be performed on all data elements within the scope of the DW in order to deliver a pool of non-redundant, clean, consistent, and integrated data. This type of analysis cannot be magically performed by a tool but requires business analysts who have the knowledge to define the organization in business terms and with a cross-functional scope. These business analysts should be business people who are data owners or data consumers, guided by a skilled IT data analyst who documents the results of the analysis.
In reality, very few DW projects have the necessary user involvement. Users are still trapped in the habit of expecting IT to build a customized silo system just for them. IT is still trapped in obliging. Both are trapped in not understanding the real issues, or choosing to ignore them. As a result, DW initiatives are more often than not little more than silo data mart projects on a new platform, and many of the data integration problems remain to the dismay of disappointed users.
Customer Relationship Management (CRM)
CRM attempts to integrate customer information with product information through related business functions, such as sales, marketing, and order fulfillment. Over the last two decades, CRM has metamorphosed into a sophisticated set of tools and applications. Unfortunately, as with ERP systems, too many CRM conversions follow the traditional habits of source to target mapping without extensive data analysis and without too much data cleansing. Data is usually moved “as-is” into the new CRM modules with the unreasonable expectation of magically having clean, consistent, and integrated customer data in their CRM system as a result. Once again, simply using new technology is not the whole solution.
Enterprise Application Integration (EAI) and Enterprise Information Integration (EII)
The need for integration will only increase when more competitive demands are placed on organizations. Data integration must happen by hook or by crook, and it must happen quickly. EAI and EII vendors understand this pressure placed on organizations. They also realize that organizations have a colossal investment in their existing systems, which should be leveraged, if at all possible. EAI and EII middleware technology provides that leverage by allowing the unrestricted sharing of data among any connected applications and data sources in the organization. The obvious advantage of these technologies is that existing data can be shared without having to make any changes to existing systems and without having to build new ones. This could be a very cost-effective solution to solve data access and data integration problems across heterogeneous systems and platforms, if – and only if, the disparate data were non-redundant or at least consistent, and if it were reasonable performance-wise to expect sharing and collaboration.
The reality is that EAI and EII tools only eliminate writing customized bridges between existing systems. The tools have no affect on the quality of the data in the existing systems, nor do they help the business people interpret the data. For example, the data element Annual Income Amount could be stored redundantly in 45 customer files and in 30 of the 45 files these “duplicate” amount fields have a different value. EAI and EII tools are of no value to determine which one of these amount fields is the correct one. That determination must be made by data owners or negotiated with data consumers.
As mentioned above, integrating data is much more than connecting a few databases, building bridges among applications, or consolidating disparate data into one database without addressing the high percentage of data inconsistency and redundancy found in every organization. Integrating data is not a rote task that can be relegated to a tool. It is an important, precise, detailed, and rigorous analysis process performed by business analysts and data administrators. This process produces an inventory of data (either as logical data models or physical databases) where all data is integrated within a common business context and appropriately deployed for maximum use in supporting the business information needs of the organization. This task is not without challenges.
Knowing Your Business Entities
Each organization has key business entities that are the heart of their business. The automobile manufacturer has dealers, the retailer has suppliers, the mortgage loan company has brokers, and almost every organization has customers of one sort or another (policyholders, depositors, borrowers, students, patients, citizens, non-profit contributors, convicts, etc.). To be successful and to effectively compete, organizations must understand these entities to the point of knowing their profiles including who they are, how they shop, what they buy, what services they use, what channels they prefer, their preferences for receiving marketing information, price sensitivity, service level requirements, and quality requirements. The information about these entities must be integrated in a fashion that will allow the organization to take clever and most-effective actions. For example, a banking customer can choose to use an automated teller machine (ATM), the bank’s Web site, a voice response unit (VRU), a customer service representative (CSR), or the teller at the counter in the bank. The choice of customer channels will have a major impact on the cost to the bank. Speaking to a CSR will cost the bank between $6.00 US and $10.00 while a Web inquiry will cost only pennies.
Mergers and Acquisitions
When organizations are merged and when businesses are acquired, the normal process is to integrate the two companies. The potential for cost savings related to the integration is usually one of the primary justifications for the merger. The operative word is potential. The reality may be quite different and it is up to the IT folks to determine just how technically difficult it would be to for the two organizations in question to merge. The following are some of the difficulties involved in the process:
- Duplicate records – a customer may belong to the two companies that are merging.
- Duplicate keys – while the supplier numbers are unique within each company’s system, there are overlapping numbers between the two companies.
- Different data types and different field lengths are used by the two companies.
- Fields with the same names have different meanings.
- Fields with the same meaning have different names.
- Corresponding fields have different data values.
- Corresponding fields have different business rules.
- If the two companies use different database management systems (DBMS), there may be some technical or design incompatibilities.
Many companies only resolve items that are technical hurdles to physical data integration, but they rarely address the other items that only have a business impact, but not a technical impact. Sometimes, the reason given for their lack of rigorous data analysis is that expectations from the business people are difficult to manage. Managers and end-users think that the integration process should be easy and fast (unless they have been through it before).
Data redundancy is rampant in almost every organization. It is not uncommon to find data repeated ten, twenty, or even more times within the organization, and nobody knows which file is the system of record and which copy of the data most accurately reflects the real world. While there are legitimate reasons for consciously maintaining controlled redundant data (e.g., performance, legal, and security reasons), uncontrolled redundant data contributes significantly to the costs of developing and maintaining multiple redundant databases and applications. The costs include programmer effort, disk and machine costs, reconciling inconsistent results, needing to update (sometimes manually) multiple systems. Redundant data results in a loss of control, misunderstandings, and a continuing bad reputation for IT. The goal is to minimize data redundancy and have a single version of true and accurate data.
How did the data get there and why is it redundant and different? It might have started off with the same raw data but it was filtered, transformed, and had different business rules applied. A redundant set of data may have been created if the application developer for a new system was unaware of the existing data. The redundant data may have been consciously created when a new system did not want the political burdens of having to rely on another data owner.
Webster’s definition for lineage is “the direct descent from an ancestor.” Data lineage is therefore the process of tracking the descendent data elements from their origins to their current instantiations. Since most organizations have a plethora of redundant data, it is important to know where the data came from to determine which data source is most reliable. Documenting the data lineage for each data element in a data dictionary or a metadata repository would provide the origin and subsequent alterations and duplications of each data element. This data inventory is an invaluable data management resource. Not only does it provide guidance for using data elements correctly, but it is also extremely useful for impact analysis. One only has to remember the immense effort with Y2K impact analysis because most companies didn’t – and still don’t track their data lineage.
Establishing data lineage for each data element is a monumental effort given the fact that organizations have millions of data elements in thousands of files and tables. Being overwhelmed by the scope of the effort, most organizations do nothing. Rather than taking the all or nothing approach, a good place to start would be on a DW project. For example, you could capture data lineage for the subset of data elements used by your DW applications with the metadata in the ETL tool. It will track what data is sourced, how the data is transformed, and where it is stored for access. In addition to the ETL tool vendors, most of the BI and DW software vendors have some metadata capability to trace data lineage.
What Data To Integrate
Data integration is performed in two layers: logical and physical. Logical data integration is the process of building an enterprise logical data model – not all at once, but one project at a time as new systems are developed or old systems are converted. Physical data integration is the process of filtering redundant data, retiring redundant files and databases, and combining data elements for the same business entity into one physical database. This process also includes stopping uncontrolled data propagation where data originates from a single provider application and is then propagated to individual, subscriber applications.
Data Integration Prioritization
You cannot boil the ocean; you have to prioritize your physical data integration deliverables. An enterprise-wide data integration effort has to be carved up into small iterative projects, starting with the most critical data and working down to the less significant data. The data integration team must determine which data is most appropriate for integration. Some data may not be suitable for integration, such as private data (e.g., department-specific data), highly secured data, and data that is too risky to integrate. The team also needs to look at historical data and decide how much of it to include in their data integration process. This is especially true for DW and BI projects. The following questions should be considered when determining which data to include in the integration process: How will the data be used? What are the political issues surrounding data sharing? What are the security issues? Are there regulatory and legal requirements that would or should prevent data sharing?
Risks Of Data Integration
There are a number of risks associated with data integration, such as lack of management commitment, cost and effort, sustainability, external data, data selection and validation. These should be considered carefully before the integration process begins, so that the investment in data integration is not wasted.
A few words on performing data consolidation or data federation in lieu of data integration:
Many people confuse data integration with data consolidation. Consolidating data simply means gathering data elements that describe the same business entity (e.g., customer or product) from multiple source databases and storing them in one database. Integrating data goes beyond that. In addition to consolidating data, integration enforces data uniqueness by standardizing the business data, enforcing business rules, and eliminating (or reducing and controlling) data redundancy. Integration also means that data is formally named, unambiguously defined, appropriately architected, and its lineage is properly documented.
In a federated approach, data does not have to be consolidated or moved to a common, integrated database. A simplistic definition of data federation is leaving the data where it is, installing middleware like EAI or EII, and providing metadata to make people aware of the existence of the data, how current the data is, and how to get to it. This eliminates the need to convert, match, filter, and merge the data and avoids the struggles with data anomalies. However, this approach requires complete and current metadata as well as clean and consistent business data, not to mention good quality processes to promote and allow federation. The federated approach is not mutually exclusive with consolidation or integration. If the data is to be infrequently accessed, federation may make sense.
Data Integration Capability Maturity Model
The generic capability maturity model (CMM) can easily be adapted to a company’s data integration strategy, as summarized in Table 1.
Table 1 Data Integration CMM Levels
|Level 1||Limited data federation; often with redundant and inconsistent data|
|Level 2||Limited data consolidation; documenting redundancies and inconsistencies|
|Level 3||Data integration initiated; new “dis-integration” is discouraged|
|Level 4||Data integration widely adopted; “dis-integration” is penalized|
|Level 5||Enterprise-wide data integration and other data strategy principles practiced by all departments in the organization|
Data integration is a topic that is well publicized, much talked about, and heavily hyped by many tool vendors. It is a complicated topic that requires considerable effort if implemented to its fullest extent. Some less effective alternatives to data integration are data consolidation and data federation. EAI and EII middleware technology can also be utilized when appropriate and cost-effective, but realize that true data integration cannot be achieved by technology alone. It requires a strategy, a plan, a team, skills, sponsorship, and an ongoing commitment from the business people.
About the Author
Larissa Moss is president of Method Focus Inc., and a senior consultant for the BI Practice at the Cutter Consortium. She has 27 years of IT experience, focused on information management. She frequently speaks at conferences worldwide on the topics of data warehousing, business intelligence, master data management, project management, development methodologies, enterprise architecture, data integration, and information quality. She is widely published and has co-authored the books Data Warehouse Project Management, Impossible Data Warehouse Situations, Business Intelligence Roadmap, and Data Strategy. Her present and past associations include Friends of NCR-Teradata, the IBM Gold Group, the Cutter Consortium, DAMA Los Angeles Chapter, the Relational Institute, and Codd & Date Consulting Group. She was a part-time faculty member at the Extended University of California Polytechnic University Pomona, and has been lecturing for TDWI, the Cutter Consortium, MIS Training Institute, Digital Consulting Inc. and Professional Education Strategies Group, Inc. She can be reached at firstname.lastname@example.org
Adelman, Sid, Larissa Moss, and Majid Abai, Data Strategy, Addison-Wesley, 2005
Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. New York, John Wiley & Sons, 1996
Hall Curt. BI Advisory Service, Executive Update Vol. 4, No. 6. Cutter Consortium, 2004.
Linthicum, David. Enterprise Application Integration. Addison Wesley, 2000.
Levy, Evan. Baseline Consulting Group. Architectural Alternatives for Data Integration, TDWI FlashPoint, September 22, 2004
Moss, Larissa T. Data Strategy: Survival Guide for the Information Age. Cutter IT Journal, Vol. 19, No. 8, August 2006