A BRIEF HISTORY OF DATA WAREHOUSING: FROM THE VENDORS PERSPECTIVE (PART I)
By Bill Inmon
If you just happened to enter the data warehouse profession, you will find an almost bewildering choice of technologies, all of which purport to be data warehouse technologies. Indeed, the world of data warehousing is complex and there are many facets. And it seems that each facet of data warehousing has its own technology.
So, exactly how did things get to be the way they are today? It may help to stand back and take a look – from a vendor’s perspective – as to how things got to be the way they are today from the events and technologies of yesterday.
First and foremost the movement to data warehousing was a consumer driven movement. Some technology trends are vendor and product driven. Other trends are consumer driven. Characteristic of a vendor and product driven marketplace are a few well defined products that have a broad range of functionality. Characteristic of a consumer driven market place are lots of technologies, where the consumer takes what is already in place and adapts that technology to suit the needs of the consumer. Consumers have to knit together different technologies to achieve the effects they want. Data warehouse – in that regard – is definitely a consumer driven marketplace.
There are many important facets of data warehousing, from the standpoint of products and tasks that need to be done. As a general grouping of products, there are:
- data base management products
- hardware products – processors
- hardware products – storage devices
- ETL (extract/transform/load) – products
- Analytical products
- Metadata products
Each of these product categories fill an important and definite need. In some cases, there is overlap of a product or a company into more than one category.
DATA BASE MANAGEMENT PRODUCTS
The most basic of the products needed for the data warehouse environment is that of the data base management system. Data base management systems long preceded data warehousing. Most of the early data base management systems were oriented toward transaction processing and record-at-a time processing. Some of the dbms made the transition to data warehousing, some didn’t. The dbms vendors that made the transition to the world of data warehousing were Oracle, IBM’s DB2, NT SQL Server, and Teradata. Oracle appealed to the “let’s get started cheap” set with their SMP architecture. IBM’s DB2 was supported by both an SMP and an MPP architecture. NT SQL Server appealed to the organizations that wanted to start really small and really inexpensively.
But most curious of all was Teradata. Teradata was different from all other dbms because Teradata specialized in handling large amounts of data, was not a particularly good transaction processing platform, and offered a shared MPP solution. From an architectural standpoint, Teradata was different from its brethren.
It is worthwhile noting that all of these dbms technologies were in existence when data warehousing emerged. None of these technologies were built specifically for data warehousing. Instead, customers adapted these technologies to suit their needs for data warehousing.
There were a few other dbms in existence that met some of the characteristics required for data warehousing. One of those was Model 204. In many regards Model 204 had the characteristics needed for a good data warehouse dbms. But Model 204 chose not to pursue the data warehouse marketplace. Another dbms that was in existence was Sybase. Sybase was known for its distributed transaction processing. Sybase chose not to aggressively pursue the data warehouse marketplace. (I shall never forget that in the early days of data warehousing, a senior manager from Sybase told me – we do not believe in the data warehouse marketplace. Please don’t ever mention Sybase and data warehouse together. We are purely plain and simple – a transaction processing dbms.)
Another dbms of the time was Informix. Informix had many of the really good requirements for data warehousing. They aggressively went after the marketplace. And soon were bought by IBM.
Finally there was Red Brick. Red Brick was a dbms of sorts. It was more of a super data mart processor than a data warehouse processor. It was good for data that had a single focus and that did not semantically change a great deal over time. It was the only technology that appeared to serve the dbms space for data warehousing. After an entrance into the market space, it was bought by IBM.
HARDWARE PRODUCTS – PROCESSORS
While dbms were certainly necessary, the dbms were useless unless they ran on a processor. There were essentially two style processors – an SMP processor and an MPP processor. An SMP processor was relatively inexpensive, compared to an MPP processor. An SMP processor was one where data and memory were tightly shared and where there was a finite boundary for the processing or processors that were bound together in an SMP configuration. An MPP approach is one where data streams are run in parallel and there is no shared memory or anything else. An MPP environment is expandable to the point that more processors can be bought in order to process a bigger workload.
There were advantages and disadvantages to both the SMP and MPP environment:
- the SMP environment was cheaper and simpler, easier to get started. But there was a finite limit to the capacity of the SMP processor.
- The MPP environment was more expensive, and more complex to set up and operate. But there was effectively no limit to the volume of data that could be handled by the MPP environment.
Some of the early vendors servicing the processor marketplace were Teradata, IBM, Tandem Computers, HP (before the acquisition of Tandem), and others. IBM had both an MPP approach and an SMP, mainframe oriented approach. Sun MicroSystems also served the marketplace in a niche capacity. (Of the hardware vendors, Sun MicroSystems had by far the most tangential approach to sales. Sun sold boxes while the other vendors sold solutions or close to solutions.)
Another company worthy of mention is Netezza. Netezza lowers the cost of data warehousing while raising the bar with response time. Netezza is in a new class of technology that can be called a data warehouse utility.
HARDWARE PRODUCTS – STORAGE DEVICES
Nearly all data in a data warehouse environment is placed on disk storage. The vendors for disk storage that serviced the data warehouse were – IBM, Hitachi, and EMC. (I once was asked if I was an agent for the disk storage vendors because of the enormous volumes of data that were being dedicated to data warehousing. I am not.)
But over time, as data accumulated, the sheer volume of data dictated that there be other types of storage that data be placed on as data aged and the probability of access diminished. There were essentially three types of storage needs for a data warehouse other than standard high performance access. Those needs are:
- archival processing,
- back up and recovery,
- online overflow processing, sometimes called near line storage.
There was one vendor of these types of products – Storage Tek – and for reasons known only to Storage Tek – they chose not to pursue the marketplace. (I was once told by a Storage Tek manager – don’t even bring up these possibilities for the usage of our products. We really don’t need a bunch of new customers running around here.) Storage Tek was bought by Sun MicroSystems.
Another storage vendor of note is Network Applications (NetApps). Net Apps was the first vendor to bring out products for secondary storage.
About the Author
Bill Inmon, the father of the data warehouse concept, the corporate information factory, and the government information factory has written 47 books on data warehouse, data base, and information technology management. His publishers include John Wiley, Prentice-Hall, and QED. His books have been translated into nine languages. More than thirty of his books have been book club selections. In addition Bill has written over 750 articles for trade journals such as Data Management Review, Byte, Datamation, ComputerWorld, and many others. Currently Bill has a newsletter with b-eye-network.com that reaches 55,000 people. Bill founded Inmon Data Systems, a company that reads and manages unstructured data - emails, telephone transcripts, documents - and processes them for inclusion into a structured data warehouse. In addition IDS creates visualizations for unstructured data. IDS has technology that crosses the bridge between unstructured data and structured data, currently protected by seven patents. Bill can be reached at BInmon@BillInmon.com