The Semantic Web and EIM: Part 1 – Data Governance and Stewardship

By Pete Stiglich

In this first of a series of articles on the Semantic Web and EIM (of which, this is the first), I will explore how the Semantic Web and semantic technologies will impact the different components of EIM. In this article I will discuss how the Semantic Web and semantic technologies can help enable perhaps the most critical aspect of EIM: Data Governance and Stewardship.

Overview of the Semantic Web

According to W3C “The Semantic Web is a Web of data” [1]. The Semantic Web and semantic technologies will make it easier to find, share, and use information, and will increase opportunities for automation. For example, NASA is using semantic technologies to find experts in its organization of over 70,000 civil servants. In the past the World Wide Web (WWW) was a “web of documents”, whereas the Semantic Web is a web of data – identifying and tying together the data elements within web documents. The WWW is being transformed into the GGG (Giant Global Graph) with interconnected, discrete data elements – basically, a giant database. Many public US government datasets have been serialized in RDF/XML (RDF is a key Semantic Web enabling technology) on data.gov. For a more in depth introduction to the Semantic Web, please see my article on “Getting started on your Semantic (Data) Web”.

The Semantic Web is enabled by semantic technologies, most notably RDF (Resource Definition Framework) and OWL (Web Ontology Language). Using RDF we can describe data resources using a W3C standard which can enable us to easily tie together disparate information from across the internet (or intranet). For example, XML or HTML documents can be encoded with the Dublin Core “creator” meta data element to identify the author of the document. Now we can identify documents authored by a particular writer, rather than doing a google-type search based on the author’s name (which will retrieve all the documents which include the author’s name – but not necessarily just those documents authored by the writer). For our internal documents, we could encode our data with Dublin Core “creator” (and other DC elements) so that we can find information more effectively within the enterprise. Dublin Core is an example of a standard vocabulary expressed in RDF. Examples of other RDF based vocabularies include FOAF (Friend of a Friend – for social networking), SKOS (Simple Knowledge Organization System – for describing concepts). Take advantage of existing RDF based vocabularies (there are many) for describing your data – so as to avoid recreating the wheel. Of course, each enterprise would probably define its own vocabularies as well.

Inferencing

Similarly, we can use OWL to tie ontologies (which contain taxonomies and thesauri about a domain [2]) from many domains together to provide a very rich knowledge base by which we can analyze our information through the ability to draw inferences from our data which wouldn’t otherwise be possible without time consuming human investigation or the development of specialized programs or complex queries. When describing our data and meta data in RDF and having these tied to ontologies, there are many inferences which can be drawn just by expressing our information in these languages.

For example, if we assert that fruit is a subclass of food,

Before inferencing:

:fruit rdfs:subClassOf :food

then an inferencing engine could infer that both food and fruit are classes.

After inferencing, the following triples would be created:

:food rdfs:type rdfs:Class

:fruit rdfs:type rdfs:Class

If we’ve asserted elsewhere that the food class is equivalent to the Nahrung class (German word for food) as below,

:food owl:equivalentClass :nahrung

we can now do more thorough knowledge discovery of food, regardless of whether the English or German word for food is used.

After inferencing, we would see that

:fruit rdfs:subClassOf :nahrung

We can also use SWRL (Semantic Web Rule Language) to define our own inferencing rules.

Semantic Web and Data Stewardship

A central and critical aspect of Data Stewardship is to be able to know where our data is so that its definition, quality, security, and usage can be measured and improved. For example, we could define a rule in SWRL which says “If a data object (e.g. database table, data model entity) has a primary key which includes an attribute equivalent[3] to Customer Key or Customer Id and the entity contains a Social Security Number or Last Name attribute, then the Data Steward for that data object is John Doe”[4]. When our RDF and OWL data is ready, we would use an inferencing engine such as pellet to make the inferences which would in turn be expressed in RDF/OWL. We would be able to interrogate our source RDF/OWL triples and our inferenced RDF/OWL triples together to form an enriched source of information. Using the rule above, RDF triples such as the following might be inferred and therefore made available for query in an RDF file or RDF triple store.

edw:CUST_MSTR eim:hasSteward hr:JohnDoe

oe:CFCA354 eim:hasSteward hr:JohnDoe

ecdm:Customer eim:hasSteward hr:JohnDoe

Now it will be easy for someone to find all of the data objects which John Doe is the data steward of. Note that the prefix before the colon in the RDF triple above indicates the namespace. In the above examples, assume that edw is the namespace for the CUST_MSTR table in the Enterprise Data Warehouse, sap is the namespace for the table CFCA354 in the order entry system (this is not too contrived of an example – almost no one understands what the actual database table names represent in some ERP systems), ecdm is the namespace for the Customer entity in the Enterprise Conceptual Data Model, and hr:JohnDoe is a URI reference which will point to John Doe’s record in the HR namespace. The eim namespace might represent an RDF file which defines the enterprise’s EIM vocabulary (classes and properties).

Now, it is a simple matter to find all of the objects for which John Doe is the data steward. This would require a simple SPARQL query (which isn’t to say that all SPARQL queries are simple). For this example, our SPARQL query might look like:

Prefix hr: <http://www.example.com/HumanResources>

select ?dataobject

from http://www.example.com/DataObjects.rdf#

where ?dataobject eim:hasSteward hr:JohnDoe

Note that “?dataobject” is a SPARQL variable or placeholder. In our query results we would get back edw:CUST_MSTR, sap:CFCA354, and ecdm:Customer. In our source RDF file we would have meta data about these data objects (i.e., resource descriptions), and so in our query result set we could double click on edw:CUST_MSTR, etc., and find out more about the object (since the query results in this query will all be URI references), or we could continue our analysis with additional SPARQL queries. If we have other asserted or inferred RDF triples describing our data assets, we can now use SPARQL to measure and monitor the quality, definition, security, compliance with data standards, and usage of our data resources.

Data Stewards and Ontologies/Vocabularies

A Data Steward (who has assigned responsibility, authority, and accountability for oversight of the definition, quality, usage, and security of a data subject area), also needs to be involved when new ontologies and/or vocabularies (these are related but distinct concepts) are developed, changed, or incorporated. These ontologies and vocabularies may be internally or externally developed. For example, assume that an HR department has an internally developed vocabulary (in RDF) for describing employees, which was approved for use across the enterprise by the Data Governance board. Assume one property in the vocabulary is hr:employeeTitle.

The HR or Employee data steward would want to ensure that for employee data expressed in RDF, the FOAF (Friend of a Friend) property “foaf:title” is not used as (we’ll assume) there are company specific definition and rules for hr:employeeTitle, and is part of the HR taxonomy. If foaf:title is used instead of hr:employeeTitle to describe an employee’s title, this would severely impact our ability to analyze the data or draw inferences from it. Of course, we could using OWL specify that hr:employeeTitle and foaf:title are equivalent (see below) – if we choose to do so we could find the titles of people regardless of which vocabulary is used.

hr:employeeTitle owl:equivalentProperty foaf:title

Data Stewards should also be involved in or have oversight of testing ontologies before they are deployed, to ensure that incorrect inferences can’t be made.

Semantic Web and Data Governance

The principal way that Data Governance will interact with semantic technologies is through the governance and oversight function that Data Governance is concerned with. A Data Governance board or council should make strategic decisions regarding:

  • Sponsorship and oversight of semantic technology initiatives for EIM and other purposes
  • Sponsorship of projects for the development of enterprise ontologies and vocabularies, and approval for use of these
  • Approval of the use of external ontologies with enterprise data. For example, a data governance board for a hospital would make decisions around the use of a healthcare industry ontology, e.g., Disease Ontology [5]
  • The sharing of corporate data in an RDF/OWL format outside of the enterprise

The later point is not insignificant. Of course, anytime that data will be shared outside of the enterprise is a cause for possible concern and therefore data governance oversight is called for (at least from a policy perspective). When sharing data in RDF, it might possible for unintended inferences to be made which could cause concern. For example, if we’re sharing the last 4 digits of a phone number (along with other data), an external party could potentially combine the data with other pieces of data (e.g. type of disease, admitting hospital, etc) in order to identify a patient – which might violate PHI (Protected Health Information) regulations.

A Data Governance board might wish to form a sub-committee just to understand and oversee semantic technology initiatives.

Conclusion

Semantic technologies provide powerful capabilities for the governance, stewardship, and knowledge discovery of our data and meta data resources. With inferencing, we can uncover relationships which aren’t physically recorded or asserted – and so we can better understand, measure, and manage these data resources. EIM initiatives which struggle with getting a holistic view of enterprise data should investigate the use of semantic technologies. Enterprise ontologies and vocabularies are assets which should have the same degree of governance and stewardship as other data assets. Governance and stewardship are especially important if there is any risk of falling out of regulatory compliance when sharing RDF datasets outside of the enterprise, as a result of potentially damaging inferences being made. Please let me know if you have any questions with how semantic technologies can be implemented in your organization, or if you have any questions regarding this article. I can be reached atpstiglich@ewsolutions.com.

About the Author

Pete Stiglich is a Principal Consultant/Trainer with EWSolutions with nearly 25 years of IT experience in the fields of data modeling, Data Warehousing (DW), Business Intelligence (BI), Meta Data management, Enterprise Architecture (EA), Data Governance, Data Integration, Customer Relationship Management (CRM), Customer Data Integration (CDI), Master Data Management (MDM), Database Design and Administration, Data Quality, and Project Management. Pete has taught courses on Managed Meta Data Environments (MME), Data Modeling, Dimensional Data Modeling, Conceptual Data Modeling, ER/Studio, and SQL. Pete has presented at the 2008 MIT Information Quality conference, 2007 and 2008 Marco Masters Series, at DAMA at the international and local level, and at the 2007 IADQ Conference. Pete’s articles on Information Architecture have been published in Real World Decision Support, DMForum, InfoAdvisors, and the Information and Data Quality Newsletter. Pete is a listed expert for SearchDataManagement on the topics of data modeling and data warehousing. Pete is an industry thought leader in the field of Conceptual Data Modeling. He can be reached at pstiglich@ewsolutions.com

[1] http://www.w3.org/RDF/FAQ – I highly recommend reading this very good overview.

[2] Seth Early

[3]We could identify equivalent attribute names by using our ontology for alternate names – for example, cust, cst, custmr could be identified as common abbreviations for the term “customer”.

[4] Best practice would be to identify the data stewards in data model meta data, and performdata rationalization analysis using a meta data repository so that the association of data objects to data stewards could be performed more easily. In real life, meta data management in many organizations leaves much to be desired….

[5] Developed by the Center for Genetic Medicine of Northwestern University

 
Free Expert Consultation