XML’s Uses In Data Warehousing: Getting Data In
By David Marco
Over the past 10 years, data warehousing has proven to be a highly valuable technology that the vast majority of corporations have leveraged to provide them with a competitive edge in the marketplace. As we enter the next decade XML (eXtensible Markup Language) is poised to accomplish much the same. The one unanswered question is how will these two essential technologies function together? This month we will present how XML enables our data warehouses to bring in external data (from websites and other corporations). Next month we will examine how XML is used in a data warehousing environment to publish information.
Integrate Website Data
Virtually all websites have been built with HTML (HyperText Markup Language), which describes how data will be formatted, but does not provide information on this data. Consequently, this “unstructured” website data is very difficult to bring into a data warehouse system. XML provides a remedy to this situation by assigning “data tags” to this website information. To understand how these data tags function let’s use XML to describe the information about a text book:
<title>Building and Managing the Meta Data Repository</title>
<publisher>John Wiley & Sons</publisher>
By adding context to the content on a website, XML enables corporations to bring in unstructured, website data into their data warehouses. This is critical for many companies’ analysts that need this information to make better decisions. Let’s walkthrough an example using a health care company. Many doctors that research drugs will publish their results to their websites. Often the decision-makers in these health care organizations want to know about the latest developments with this drug research in order to make better patient-care decisions. To see how XML simplifies this challenge we will examine Figure 1: XML Bringing Data Into The Data Warehouse.
Figure 1 : XML Bringing Data Into The Data Warehouse
This figure illustrates data being read from a physician’s website and brought into a XML transformation process (see Figure 1, bullet 1). This transformation process (bullet 3) matches the website data to the corresponding XML schema (data tag layout). Remember that one of the key challenges for XML is to standardize on the names and meaning of the data tags. As an industry IT (information technology) has had limited success in defining global standards and I don’t expect XML to change this trend. Therefore, we will have to juggle multiple XML schemas in our corporations. Next, the XML transformation process converts the tagged website data into record format by removing the XML data tags. The removing of the XML data tags is important since these tags do increase processing overhead. These records are then sent to the ETL (extraction, transformation, and load) process of the data warehouse (bullet 4). The ETL process will clean, integrate, and load this data into the data warehouse and its corresponding data marts (bullet 5). Keep in mind that as several ETL tool vendors are looking to expand their current tool sets to include functionality to do XML transformations this XML transformation process (bullet 3) could be completely merged into the ETL process.
Often times when we think of the internet we think about B2C (business to customer) transactions, however the potential for B2B (business to business) commerce on the internet is far greater than that of B2C. Many companies are in the business of selling information. XML plays a major role in this effort as it allows B2B transactions to be brought directly into a data warehouse. Figure 1, bullet 2 shows how the B2B trading partner sends their information into the XML transformation process. As before not all B2B trading partners will use the standard XML schemas so multiple XML schemas will need to be maintained. This process (bullet 3) uses the XML schemas stored in the XML database and moves these converted transactions into the ETL (extraction, transformation, and load) process of the data warehouse (bullet 4). The ETL process then integrates this information into the data warehouse and its data marts (bullet 5).
As we can see, XML is critical technology and it coming to a data warehouse near you! Next month we will examine how XML is used to bring data out of the data warehouse.