Data Warehouse Roles and Responsibilities
By Anne Marie Smith, Ph.D.
Introduction
The classic definition of a Data Warehouse is architecture used to maintain critical historical data that has been extracted from operational data storage and transformed into formats accessible to the organization’s analytical community. The creation, implementation and maintenance of a data warehouse requires the active participation of a large cast of characters, each with his or her own set of skills, but all functioning as a series of teams within a large team.
An identification of the major roles and responsibilities for managing a data warehouse environment would normally include these functions. They are presented in the general order in which they would participate in the warehouse development and implementation.
Business Sponsorship
This is the primary officer responsible for the targeted business area. He or she should have prior experience in systems development efforts, to understand the issues that will confront the warehouse team, and the managerial authority over some of the business team members. He or she will own the data stored in the data warehouse for the targeted business area. As each business area is included in the warehouse, the targeted area’s business sponsor will own the portion of the data for their area. The business sponsor champions the data warehouse effort, helps communicate the value of data warehousing to the business community, and serves as project owner for that business area’s phase of the warehouse development. The business sponsor shares funding responsibility for the inclusion of the targeted business phase’s data into the warehouse.
Systems Sponsorship
This is the person with ultimate responsibility for all systems warehouse activities such as those listed in the roles and responsibilities chart, and should have managerial authority over the systems staff. The systems sponsor shares funding responsibility for the warehouse with the targeted business area sponsor. The systems sponsor will own and control the operational aspects of the warehouse, such as software, hardware. Usually, the systems sponsor is the organization’s Chief Information Officer.
Facilitation Coordination
Most of the development of the data warehouse will be performed by teams. The use of a skilled facilitator will enable the group to properly structure and conduct the meetings to gather the information and maintain the focus needed to achieve the goals within the project’s deadlines. The facilitator should be engaged at the start of the warehouse development process and should be a member of the development team through the modification of the chosen development methodology. The facilitator’s skills will be needed as each new area is incorporated into the warehouse, as much of the development effort will be repeated for each business area.
Education, Training and Documentation
With complex technologies, educating users and systems staff members is essential to a successful implementation. Educators should have a deep understanding of warehouse concepts and technological solutions. Strong interpersonal and communication talents are also important. Educators should develop the materials they will use in these sessions. Some areas for education should include the business case for a data warehouse, the differences between developing operational systems and warehouses, the need for strong Data Administration (also called Information Resource Management) in building and maintaining a data warehouse.
There will be several layers of training during the warehouse development process. The initial training will be for those assigned to integrate the various infrastructure tools within the pilot warehouse. This training will be oriented primarily to systems staff and business members of the development team. Later, training in the use of the warehouse, in analysis skills and in the tools given to the warehouse user community will involve both systems and business staff. Skilled trainers should conduct initial training, with later user training possibly taught by experienced warehouse users from the organization.
Data Administration
The goals of Data Administration (also called Information Resource Management – IRM) include minimizing redundant operational data and structuring data to serve the organization’s overall purpose in capturing that data. These functions are essential in successfully creating and maintaining a data warehouse. Before serious warehouse development effort can begin business and systems staff members must develop and publish naming standards, data element mapping procedures and guidelines for database creation. Without solid definitions of the data, users will not have confidence in the information they are accessing from the warehouse, and systems staff cannot readily identify the data needed from the operational systems to populate the warehouse. The organization should have a commitment to active data administration with a small group of employees overseeing the IRM efforts, especially if contractors are used in the data warehouse development and maintenance.
Database Architecture
The database architect has primary administrative responsibility for the actual physical design of the warehouse environment. He or she participates in the modeling activities as the representative of the physical implementation of the model’s entities. He or she oversees the creation of the database tables and the maintenance of the warehouse’s physical environment and monitors the changes made to the environment by more junior database analysts. The database architect’s strength lies in his or her ability to develop the vision of the physical view of the warehouse.
Repository Administration
The purpose of a repository is to provide a single point of access and control for an enterprise’s metadata. Metadata indicates where the data comes from, how it should be translated or transformed, its form and functions. Data models and processes can be stored in a repository, giving a central location for business and systems data gathered during the development effort. Managing a repository for a warehouse environment requires the skills of both a data (or IRM) administrator and a database administrator (not one person). A Repository Administrator serves as the liaison between the technical and user communities for the operational and warehouse metadata, manages the integration of the logical models of the operational and warehouse systems and participates on the standards development team.
Business Area Requirements Analysis
The goal of a data warehouse business area analysis is to provide understanding of the analytical processes and data needed for business examination. Representatives from the business area and Information Resource Management will convene to identify their requirements for a data warehouse, which will lead to the creation of a warehouse data model. The requirements gathering phase should answer the questions: “What data do we want analytical access to? What processes will this data come from? What critical decisions will this data support?” The facilitator’s function is to ensure that all the proper questions have been identified and answered satisfactorily, and that the meetings adhere to a stated agenda to conserve time and staff resources.
Data Modeling
Modeling for operational systems and modeling for informational systems use the same techniques but result in two different types of models: a.) a detailed, transaction processing representation of the operational business requirements; b.) a less detailed, more summarized analytical processing representation of the informational business requirements. Both are necessary for the proper, cohesive development of systems, and should be conducted as part of any systems development or enhancement effort. The informational data model should support the targeted business area’s analytical requirements, and be based on the area’s operational data model. A team of users builds the models, and then validates them by mapping data from the operational models to the model of the warehouse. A well-designed model, either operational or informational, should: a.) answer business users’ questions, b.) map business processes to customer data needs, c.) recognize the similarity and differences in data stored in various operational systems and d.) extract only the pertinent data. Participants in this exercise should have some modeling training, and should be led by an experienced modeler who is part of the data administration staff.
Selection of Infrastructure Tools
Infrastructure can be described as the platforms, databases and front-end access tools necessary for a functioning warehouse environment. Their selection is of paramount importance, since many subsequent decisions will be based on the tools selected in this phase. Both business and systems staff members as well as consultants should be on the selection committee, since usability and compatibility may be competing needs and must be adequately addressed for a successful warehouse implementation. A systems architect who will be ultimately responsible for procuring and integrating the various components should lead this team.
Methodology Modification
Generally, a methodology guides complex development efforts by providing a map of tasks and roles. No methodology is complete for all types of development and no methodology takes an individual enterprise’s characteristics into account. Therefore, the chosen methodology must be modified to encompass these developmental and organizational needs. A team of methodologists, business and systems staff will be assembled to customize the methodology.
Physical development of warehouse tables
When the modeling tasks are complete and the DBMS and hardware platforms have been delivered, the creation of the physical warehouse tables can begin with the database analysts and database architect. There will be some interaction between the modeling team and the database administration team to address the recurring question of normalized-versus-denormalized data. Performance and data accessibility are often competing ideals and the relative merits of each position must be weighed. The database architect and the modeling team will resolve most outstanding issues between the logical and physical representations of the data, and the questions of “star schema, snowflakes, et al” can be addressed here. Database administrators and database analysts can perform actual creation and maintenance of the physical tables and their associated structures such as indices.
Creation of extraction and transformation programs
The actual creation of a data warehouse entails the extraction of specific data elements from various systems, databases and files into a relational database. Before being loaded into this database, some form of data transformation or integration will be required. This may involve changing the formats of certain elements to conform to a standard, making all iterations of an element have the same data type and size or altering the algorithm that summarizes certain elements. These programs for transforming and extracting data can be developed, or they can be purchased from vendors and customized for the particular environment.
Warehouse physical access, cross-environment connectivity and security
The issues of access, connectivity and security will be addressed at first during the infrastructure tools selection process. Later, during database design and programming, further identification of issues and their resolutions should occur.
Creation of complex queries for warehouse users
Even though users will begin to access the warehouse data through the tools provided, some analysis may require the creation of complex queries against the database. These queries should be written by an experienced SQL applications programmer, to maximize the performance of the warehouse’s database. A series of expected complex queries should be developed before the warehouse is implemented, so users can perform their analysis independently and with little delay. A process must be created to ensure that all queries created for use within the warehouse are saved for reuse, and are available for modification.
Testing component integration, data loading and query execution
Any development effort is not complete until it has been thoroughly tested. The testing for the data warehouse should include developing test cases, scenarios and scripts to ensure the quality of the application; verification of the usability of the tools and the proper execution of all functions of the warehouse. Testers should have a quality assurance background and use accepted testing methods. Some understanding of the infrastructure’s integration is expected, so the test team can identify all possible problem areas and test the interfaces and application usages. Members of the test teams should be certain that no test issue is overlooked.
Warehouse Administration
As with any complex system, a data warehouse will need a warehouse administrator. He or she should be familiar with both the logical and physical models of the warehouse, and will co-ordinate the scheduling of warehouse refreshment processes, manage the creation and reuse of queries written for the warehouse, serve as liaison to the warehouse with the repository administrator, and act as a coordinator during the integration of business areas into the warehouse environment.
Warehouse production support for databases and users
All production environments require continual support, such as help desk activities for users and systems support for the hardware and applications. Hardware and software upgrades would also be included in the production support team’s responsibilities.
Further Discussion on Selected Functions
Sponsorship
Without sponsorship, no project will be successful, especially one as complex as a data warehouse. The key to a successful warehouse development and implementation is teamwork and support at all levels of the organization, with the data warehouse identified as a method for meeting critical business requirements. It is crucial that management for both business and systems understands the roles and responsibilities of all parties, and deadlines are realistically set. Information Resource Management staff must relate the concepts and benefits of the warehouse to management and technicians. Management must anticipate how the warehouse will change roles within the organization. Technicians must evaluate and understand the tools and methodology that can best support the business objectives. Business users must personally understand the concepts and benefits of a warehouse, and accept the commitment such development requires. Active sponsorship should enable these objectives to be met. Active sponsors occasionally must break decision deadlocks and should be available for all critical meetings of the entire development team. Since sponsors are usually responsible for funding the project, their acceptance of decisions made, the deadlines set and the roles to be played by members of their organizations is essential for achieving the warehouse implementation.
Education and Training
Before any development occurs, certain members of the development team should give overviews of data warehouse and related concepts. This education should involve all staff members who will have some association with the development of the warehouse, including management and sponsors. There are many ways to educate people in warehouse concepts, opportunities, limitations and processes. Some of these forms could include: widely distributed documents that cover a single topic in the warehouse environment, such as mapping operational data to the warehouse; concept presentations given by a warehouse development team member to a diverse audience or to a selected group of technicians or business people; open forums between the development team and interested people for questions and answers on warehouse process creation. More opportunities for education will present themselves during the development of the warehouse environment. Every available opportunity should be seized for furthering the understanding of warehousing at all levels of the organization, so everyone will know of its possibilities and its constraints. Some examples of the education that will be required are: data warehouse purpose and concepts, the information environment at the organization, and the changes to information gathering and analysis that the introduction of a warehouse creates.
Education is a continuing process, while training is a finite one. Initially, training will focus on the proper use of the infrastructure tools chosen for the warehouse and their integration into the architecture. Basic training should be followed by advanced sessions for those who require this level of knowledge. During implementation, training of analysts and other warehouse users and support staff will be needed. This training includes some concept review for novice users; detailed procedures for accomplishing warehouse tasks such as extracting, refining, moving and accessing data for users and support staff; possible analysis training for understanding the types of questions that can be answered by the data warehouse and the methods with which to answer them; and personal computer interface training for those not comfortable in that environment.
Data Administration and Data Modeling
The ability to define, organize and use data has been acknowledged as a critical factor in successfully using information technology such as a data warehouse. Equally critical in the heterogeneous environments that comprise the sources of the data warehouse is the need to manage the relationships of data meanings, key business facts and processes. Active, proper data administration or Information Resource Management (IRM) is essential to the success of a warehouse that really permits users to access data and gather information.
Data warehouses can represent a major opportunity for data administration (IRM). Since the focus of a warehouse is to deliver accurate, integrated data to be used by the organization’s analysts and management, data administration can serve as the logical steward of the warehouse. Metadata – the meanings, formats and relationships of the data – is a very important aspect of the warehouse environment, and should be viewed as a business resource. Standards for naming data elements in both business and technical formats, procedures for the identification and acceptance of new data elements and the guidelines for logical database design are all parts of the data administration portfolio. In a warehouse, metadata management takes on a new dimension, as the historical aspect of the data is an important consideration when retrieving data in a warehouse application. The versions of metadata for a warehouse is a significant issue that data administration should address and manage. Data administration should be the leader in developing the strategies for achieving data quality and data integrity through metadata management, and for modeling the data to be included in the warehouse.
Data modeling is valuable and absolutely essential for effectively managing the enterprise’s information resources. Data modeling identifies the things (entities) the enterprise needs to know about. The business facts (attributes) are represented in the data model along with the relationships and business rules that govern these attributes. A data model enables stable and flexible database development, especially in a data warehouse. Stability means that new applications can reuse the database without modifying the existing entities, attributes or relationships. Flexibility means the database can support changes to the business processes without major modifications. Both these qualities are necessary in any information environment, but are essential to the basic foundation of the information gathering and refining to be done in creating a warehouse. All systems development efforts should have a modeling phase as an initial activity. Using any application data models that exist for targeted areas of the warehouse will reduce the time needed for requirements identification of warehouse data and processes. However, if the targeted business areas do not have a proper operational data model, the first step in building a data warehouse that serves those business areas should be the creation of an operational data model for those areas. Informational data modeling can then proceed using the operational models as a basis. The informational data model serves as the road map for the data warehouse and as a guide for compatibility with concurrent and future efforts in development or enhancement.
Repository Administration
In many organizations repositories are viewed as simply large data dictionaries, with all the negative connotations associated with the cumbersome data dictionaries of the past. However, a repository is much more than a data dictionary, although its initial function was to replace the unwieldy dictionaries of earlier generations. A repository serves as the information directory and information warehouse gateway of an enterprise, functions that satisfy the need for users and technicians to understand the data stored in disparate systems and applications. An information directory does not provide access to the data itself, but to the information about the data. Examples of this type of information could include: “Which reports contain these data elements?; How is the average calculated on this report?; What are the data type and size of this field, and how does it compare to a field with a similar name?; “What applications feed this database?”
Repositories store data about the data elements found in an organization’s systems; it also stores the data models, program code and data administration standards for the organization. As such, it can serve as the source for information about the data and applications that are associated with the warehouse. With this information, a user could discover exactly which iteration of the data element “Customer ID” should be included in the warehouse. A repository can also identify all the variations of “Customer ID” or “Customer Number” and relate them to the primary element for the benefit of both users and technicians.
Developing a data warehouse without a repository is difficult to impossible, since information about the data – metadata – permeates the warehouse environment. Without a repository, developers will attempt to design a system that accesses other systems to retrieve data without knowing if the data needed for the warehouse is truly the data available in the feeder system. Data warehouses collect data from various sources, transform it in some manner, and place it in a database for retrieval as part of a collection. The meaning of the data is dependent on the collection in its entirety. This makes the task of defining the metadata extremely critical. This is a primary task of a functioning enterprise repository. These definitions of the metadata can be viewed as the users’ gateway to the warehouse, integrating the warehouse with the rest of the other systems within the organization, automating population of the warehouse, and graphically representing the warehouse for discovery and navigation by users and technicians. Using a repository as an integral part of the warehouse development and implementation will ensure a successful and popular warehouse for both users and technicians.
Infrastructure Tools
The data warehouse is a part of a large infrastructure. Surrounding the warehouse is a complex and sophisticated set of hardware, software and procedures. These components include: the platform and software that actually holds the warehouse databases; warehouse development and management software; warehouse access and analysis software and hardware.
The data warehouse sits on one or more hardware platforms: mainframe, parallel processor or client/server. The best platform for an organization is determined by the considerations of data volume and processing cost. Parts of the warehouse can be found on one platform and other parts can be located on another, as long as strong interfaces exist among the parts.
Generally, the best relational database management systems (DBMS) are the best for warehousing, and are most commonly used. These systems have recently been joined in the marketplace by specialized warehousing DBMS. Although the specialized software has been altered to perform query retrieval and indexing more efficiently than the generic DBMS, most warehouse developers choose the generic DBMS for their broad functionality, their established customer base and technical support capabilities.
Warehouse development and management tools create and maintain the interface between the operational systems environment and the warehouse. This interface can be extremely complex and requires significant functionality. Some of these data functions include: converting, mapping, reformatting, recalculating, restructuring, summing, etc.Certain vendors have a complete suite of tools that cover all these needs, others serve only one or two areas and their tools must be combined to perform all the data management functions of a warehouse. Choosing between a suite or a collection of tools from multiple vendors is a basic duty of the infrastructure tools selection team.
A very important aspect of the warehouse infrastructure resides between the operational systems and the warehouse itself: middleware. This software is designed for the regular movement of large, variable amounts of data over different technologies. The choice of middleware is often influenced by the chosen DBMS, platforms and operational environments.
There are several layers of access and analysis tools for a warehouse, with the choice depending on the ultimate user of the analysis. Executive Information Systems (EIS) provide much summary information, with pre-fabricated queries and screens. There is little or no individual browsing or querying performed within an EIS.
Middle management analysis requires very flexible analysis of data with finite dimensions. Multidimensional tools allow the user to perform searching, iterative queries over a subset of retrieved data. Some pre-fabricated and some individual browsing and querying are expected in this level.
Workstation analysis is done by someone who may want to access and analyze a portion of the entire data warehouse. Most workstation access and analysis is individually developed, since one day’s activities may have little or no relevance to the next day’s. Both detailed and summary data needs can be found at this level. Tools are generally robust, with powerful user interfaces and online help support. However, users must understand the various types of analysis and the results desired from a particular query, since different questions require different types of analysis.
Selection and discovery software is used to access the data warehouse and select data for transport back to the analytical platforms. This type of software shields users from the inadequacies and complexities of SQL as an access tool.
Methodology
A methodology is a formal outline of the tasks needed to accomplish a certain process. Over the last ten years, several different methodologies have evolved for instituting information systems. All of these methodologies are concerned primarily with operational systems, not analytical data warehouse systems. The experience of many companies has shown that attempting to implement a data warehouse without clearly identified business objectives and a formal development approach can prove to be disastrous. A new framework is needed for delivering critical analytical information systems, one which can build on accepted methodologies but is enhanced to address the needs of data warehouse implementation.
Whatever the business objectives of the warehouse development may be, they should be concretely stated in the mission of the project. This is the first tenet of any systems development methodology. Commitments of adherence to the task maps developed by the methodologist are essential; the methodology will be modified by the development team, and the final process/task list should be clearly accessible to all concerned with the warehouse implementation. Only those who know the methodology can follow the methodology, so education, training and reinforcement in methodology and systems development are very important, as is the need to establish and reevaluate priorities.
The services of a skilled methodologist, who may also function as the warehouse facilitator, are crucial during such a complex development as a data warehouse. This person will be responsible for refining the chosen methodology to reflect the steps needed for data warehouse development in general and the organization’s dynamics in particular. With the methodologist’s assistance, business requirements analysis and data modeling can proceed smoothly, and the results from those efforts can be incorporated into the later stages of the warehouse implementation. Without a formal, refined and consistently applied methodology, systems development and implementation can lose focus and continue interminably.
Physical Database Development, Programming, Connectivity, etc…
The majority of the work involved in the physical aspects of data warehouse construction involve only certain systems team members with very specialized skills. In some organizations, these people will be drawn from the ranks of existing systems staff. However, the more likely occurrence will be that many of these functions will be done by contract personnel, hired solely to support the physical implementation of the warehouse. Most companies do not have a staff with the depth and breadth of skills needed for such an endeavor. Contract programmers, database administrators and network specialists can train the maintenance support staff from the hiring organization, or the contractors can become the support staff after the warehouse is implemented. However, the senior member of the physical team, the database architect, should be an employee of the organization, since they will have ultimate responsibility for the implementation and functioning of the entire warehouse environment.
A fully normalized data model should be the first step in deciding what data from what current operational systems for what periods goes into the warehouse, but the true art of physical modeling for the warehouse involves the dreaded word “denormalizing” for access and performance. Since the data warehouse does not support updates from end-users, synchronizing data after updating is not an issue; therefore, the model and database can contain some redundant data to provide efficiency and ease-of-use.
The task lists for physical development must be formulated with the team’s active participation; no plan is correct without the input of those assigned to perform the tasks. Some of the tasks will include: creating physical tables from the logical data model; writing or modifying the extract, transformation and load applications for the operational data; creating the required connections from the database to the client machines; creating the linkages from the database to the presentation applications; writing the initial set of complex queries for the user staff. Each of these categories conceals a host of issues and tasks – and the implementation team will need to identify and satisfy all of them.
Testing
Testing should be performed by a team of business users and testers. Testers should be skilled in quality assurance and accepted testing methods, with some knowledge of the warehouse environment and the applications’ intentions. Systems support staff should be in attendance during the testing phase to resolve problems with the environment and tune the applications as necessary. Testers are drawn from the ranks of both business and systems staff, and can be augmented by consultants, as the development dictates.
A formal testing plan should be developed by the business users and testers. It should address all expected conditions, both usual and rare. The plan should be executed over a sufficient period of time to ensure that all probable combinations of events and data have been transacted. Only after the warehouse and the associated applications have been thoroughly tested should the user public be granted permission to access the warehouse. If the applications are to be released in stages, each stage should have a formal test to validate the stage and give ownership of the stage to the user community.
Conclusion
The roles and responsibilities in a complex systems development and implementation process such as a data warehouse can be generally identified, but refinement and assignment of these roles will continue over the life of the project. Most work will occur in teams, with the active participation of consultants and contract personnel to enhance the staff capabilities within the organization. Where teams consist of both company employees and contract personnel for systems and business areas, there should be one person identified as the team member with primary responsibility for the tasks assigned to that team.