The Forgotten Key
By Bruce Johnson
Databases are the root of the success for the applications that they support. Creating an effective database for any need requires understanding the business needs you are trying to fulfill, the data that is encompassed in it, and the people who are going to use it. Getting the business resources together to gather the information needed to build the data model is the first step. Spending time with the right subject matter experts to understand the business processes and relationships between the data gives you the information required to create a data model that will accurately and adequately serve the application you build. Looking at the actual data content, where possible, can help validate the data definition constructs collected.
The success of any data model and ultimately, the physical database schema hinges on how one entity relates to another. It really is, all about the data. One of the most important aspects of a successful data design is the definition of what makes each entity unique.
In this article we are going to talk about the incorporation of key field identification in the data model and database design process.
Defining Uniqueness
Whether you are defining the unique attributes of an entity in a data model or the unique key components for each table, knowing what makes each entity or table unique is a critical component for success. In order to both successfully capture data and navigate data for using and analyzing, you must be able to identify what makes each row in each table unique.
It is very interesting that many technical resources now lean on the surrogate key as the unique identifier in all tables. Surrogate keys are generated unique identifiers for database tables that have no actual meaning in the contents of the ID – that is what the concept of surrogate is, to fill in for the original. The root reason for surrogate keys stems from the need to relate one table to another and rather than storing all fields of a key as a foreign key in any table that the information relates to, you have a unique identifier that is used as a foreign key for all of those tables. This can greatly reduce storage, processing, and complexity of code/data access. It is hard to imagine building effective applications of any size or complexity without incorporating surrogate keys for much of the solution.
Surrogate keys are critical to the success of most database solutions, but a surrogate key is practically useless without a UAK – Unique Alternate Key (or Natural Key as many refer to it). This key is the true definition of the uniqueness of the contents of the table. Without properly identifying this key for every table, a barrier is put in place for the applications that capture and leverage the data from the database as for what it is designed to serve. If you cannot identify what makes a table unique, yet go forward with creating a surrogate key, it will not be possible to apply it consistently and accurately, resulting in poor quality data and difficultly in accessing and trusting what you have captured.
UAK Examples
Let’s take a look at a couple of simplistic examples of how a UAK is represented in some common table structures:
An employee table is almost always identified by a surrogate key that is something likeEmpl_ID. It is indeed a number that does not exist and is completely unknown to every employee prior to starting with that company. Once hired, you may or may not know the ID that is actually stored as the common identifier in the tables. What is important is that if there are 2 employees with the same name, that your system does not let them both have the same number. If we are dealing with only U.S. based employees, your company could theoretically rely on Social Security Number as a unique identifier – in and of itself, SSN is a surrogate identifier for a person in the US from a payroll perspective. Often times for the employee table, the natural key would be name and birth date (one can always make the argument that even these fields could change, you can certainly change your name, and technically, you could enter a birth date wrong in the system and have to correct it – neither of these cases requires a change in the unique characteristics of an employee, just in how the process is applied to update the base record for the employee properly).
Another example might be Cust_ID. As you can imagine, this is the surrogate key for the customer table. How do you recognize a new customer from an existing one? Without defining the logical differentiators in customer uniqueness, creating rules that can be incorporated into your program is not possible. A customer can be many things across various industries and companies, but perhaps a simplistic option for the UAK would be the Customer Name andBilling Address – recognizing again that a customer address can change. I recently experienced a scenario where the concept of customer was a company and because of the nature of the business, a website address was used for identifying the uniqueness – I wouldn’t recommend this unless you are very clear that it is indeed a valid option for what you call a customer.
Summary
A process of using UAK/Natural Keys does not replace or change the need for surrogate keys. Without a formally defined UAK, how do you know what rules to apply for when to insert a record vs. update an existing one?
As your organization goes forward building data models, database schemas, and databased applications, try adopting the UAK/Natural Key for every project and every entity. In fact, rely on it early on in your data definition tasks. Your project development time will decrease by more accuracy and less redesign once the data is being populated and manipulated. At the same time, your data quality and accuracy will increase as the rules and relationships are now clearly defined.
Even if you are working with an EAV (Entity, Attribute, Value) type schema, they are only effective with a logical foundation that serves to describe the entities, attributes, and how entities relate to each other.
About the Author
Bruce has over 20 years of IT experience focused on data / application architecture, and IT management, mostly relating to Data Warehousing. His work spans the industries of healthcare, finance, travel, transportation, retailing, and other areas working formally as an IT architect, manager/director, and consultant. Bruce has successfully engaged business leadership in understanding the value of enterprise data management and establishing the backing and funding to build enterprise data architecture programs for large companies. He has taught classes to business and IT resources ranging from data modeling and ETL architecture to specific BI/ETL tools and subjects like “getting business value from BI tools”. He enjoys speaking at conferences and seminars on data delivery and data architectures. Bruce D. Johnson is the Managing director of Data Architecture, Strategy, and Governance for Recombinant Data (a healthcare solutions provider) and can be reached at [email protected]