The Semantic Web and EIM: Part 3 – Meta Data Management
By Pete Stiglich
This is the third article in a series of articles on the Semantic Web and EIM. (Click here for the second installment) In this article I discuss how the Semantic Web and Semantic Technologies can be used for meta data management and meta data analysis.
Before we discuss Meta Data Management as per a Managed Meta Data Environment (MME), it is important to understand that meta data is an important aspect of the Semantic Web in itself – for example unstructured data can be described with Dublin Core RDF meta data elements such as dc:creator. Business concepts and terms can be described in SKOS (Simple Knowledge Organization System) elements such as SKOS:broader, SKOS:narrower (referring to broader or narrower terms). People and their relationships can be described using FOAF (Friend of a Friend). Using these and other RDF vocabularies in conjunction with your internal vocabularies, ontologies, and data can provide for outstanding information sharing and knowledge discovery capabilities.
RDF/OWL for Meta Data Management
Given the simple (but powerful) paradigm of RDF triples (on which OWL is built on) of “subject- predicate-object”, could an enterprise MME use an RDF triplestore as the Meta Data Repository (MDR)? The short answer is “yes” – meta data is currently stored using different modeling paradigms such as relational (e.g., Adaptive Metadata Manager) or associative (e.g., Rochade) – and these modeling paradigms all have strengths and weaknesses. I am not aware of any enterprise-level MDR tools which currently use RDF for the repository – but that is not to say that this will always be the case.
I will identify pro’s and con’s of using RDF for the MDR, and I will propose a hybrid approach in which the strengths of RDF can be leveraged and downsides minimized. I will also discuss using RDF for a registry approach to Meta Data Management.
Pro’s of RDF For Meta Data Repositories
Some pro’s for using RDF for a Meta Data Repository:
Enable deductive reasoning over meta data AND data. Many inferences can be drawn which uncover knowledge which might be more difficult to find with existing MDR tools.
Easy meta data portability as RDF/OWL are vendor neutral standards.
Many semantic technologies are open source or freeware thus reducing costs.
Relative ease of query (once some familiarity with SPARQL is achieved).
Triple stores can utilize databases such as Oracle 11g, Sesame, Jena TDB, and others to handle billions of triples – so you’re not limited to storing triples in an RDF/XML or N3 file.
With some MDR tools, much thought must be put into metamodeling and you might need to convert tool specific meta data into a more generic model in order to facilitate end to end data lineage or impact analysis. For example, in Rochade, you load ERwin meta data into an ERwin subject area, Informatica meta data into an Informatica subject area, Oracle meta data into an Oracle subject area, and so on. Up until this point, you still basically have islands of meta data even though you’re in the same repository as these subject areas aren’t integrated. To enable this (physical) integration you extract the meta data from these tool-specific subject areas into a tool agnostic metamodel (e.g., CWM).
With RDF/OWL we can bypass physical integration into a tool agnostic metamodel. The MME repository could just be RDF files spread across the world and tied together with an enterprise metamodel (expressed in RDF/OWL) – in other words, virtual integration. Or all of these RDF files can be combined together in an RDF triplestore for more efficient querying and management. For example, assume we have an abstract concept called “Attribute” which, depending on the context, might represent a logical data model attribute, a physical model column, a database column, a column in an ETL data source or target, an XML attribute, a flat file field, a report field, etc. Instead of integrating all of these into a higher level metamodel, with RDF/OWL we could simply map these as subproperties of “Attribute” and with just that we will be able to do quite a lot.
In the example below (Figure 1), we have meta data from a limited number of sources. In these, the namespace prefixes are contrived – assume that “dwldm” is the namespace for a specific Data Warehouse Logical Data Model, “dwpdm” represents the namespace for a specific Data Warehouse Physical Data Model, “dw” the namespace for the Data Warehouse database, and so on (i.e., MOF M1 level meta models). The namespace prefixes “ldm”, “pdm”, “db”, and “rpt” represent ontologies which represents the meta-meta models (or M2 level models in MOF) for logical data models, physical data models, databases, etc.
Keep in mind that the fragment id’s in our URI references (e.g., productNumber in the namespace dwldm) might not represent the actual name of the Attribute. There might be another triple which has the actual name from the model (e.g., dwldm:productNumber rdfs:label “Product Number”).
Figure 1 – Example RDF triples from multiple meta data sources – showing how the meta data is logically related.
So now we have multiple meta data sources represented in RDF – but these aren’t integrated yet. Let’s tie these together with a simple taxonomy.
Figure 2 – Attribute Taxonomy
Now we can ask questions such as (assuming more meta data has been captured than expressed above): What attributes have a length of 11 and which can be null? What have “SSN” or “social” in the name? What attributes do not have a description? What attributes and meta data sources talk about On Hand Quantity? We would be able to answer these without having to know in advance if we’re talking about a logical attribute, a database column, a report field, etc.
To do Information Supply Chain (horizontal data lineage) or Data Rationalization (vertical data lineage) analysis or impact analysis across these attributes, we will need a way to map the instances (or individuals) – e.g., map dwldm:productNumber to dwpdm:prod_nbr. This can be done with RDF triples as well. First we have to define our mapping properties, and then we can map the instances.
Figure 3 – Mapping Properties In A Mapping Ontology
Figure 3 – Mapping Properties In A Mapping Ontology
Figure 3 – Mapping Properties In A Mapping Ontology
Figure 4- Mapping Instances
We can now ask questions such as “Where is the logical attribute ‘Product Number’ used in reports?”, “If I need to change the length of Product Number, what are all the downstream impacts?” SPARQL can’t query (currently) hierarchical structures of indeterminate depth, but there are ways to get around this issue. Of course, for MDR tools handling indeterminate or ragged hierarchies efficiently is their bread and butter. One alternative to address this issue in SPARQL is to define these mapping properties as subproperties and so with inferencing be able to work around the problem. For example, define “mme:implementedBy rdfs:subPropertyOf mme:physicallizedBy” and “mme:usedBy rdfs:subPropertyOf mme:implementedBy”, and so on – but this would not be correct from a modeling perspective: mme:usedBy is not a more specific mme:implementedBy. There are some extensions to SPARQL and other techniques to address this issue.
All of the above types of queries can be done using an MDR tool, but the most powerful reason for using RDF for an MDR is that meta data triples can be easily combined with data triples in order to draw inferences that would otherwise be much more difficult to obtain – now we can have a much richer knowledge base. In other words, we’re forming the Giant Global Graph (GGG) Tim Berners-Lee envisioned for the Semantic Web – even if our GGG is limited to our enterprise, behind the firewall.
For example, I’m interested in a business concept “Industry Classification” and I want to find out as much about “Industry Classification” in the enterprise as possible. Using RDF, we could retrieve the meta data and data about this concept. We could see the different ways it is implemented, different names for the concept, descriptions, taxonomies (e.g. an NAICS Industry rolls up to an Industry Group which rolls up to a Sub Sector, etc.), and actual data values. I don’t want to imply that RDF is a silver bullet – of course it requires architecture, resources, and effort (e.g., ontology modeling is not for the faint of heart) to set this up, and there are other complexities to think about.
However, the alternative is that our analysts might perform some meta data analysis against an MDR (if you’re fortunate, a single MME MDR), and then have to access the data requiring different toolsets – and since the data is probably in many different kinds of databases we might have to wait to get access… and so we’re talking about serious roadblocks to performing analysis. But the biggest drawback to not utilizing RDF/OWL (and SWRL) is that the automated deductive reasoning that these technologies can provide has to be performed manually or by ad-hoc programming. Using these semantic technologies means that data, meta data, business rules, and much logic can be encapsulated together. Currently, a significant part of application development is encoding business rules to create new knowledge.
Con’s Of RDF For Meta Data Repositories
Of course, the repository is just one of many components of a robust MME. MDR tools are mature, proven technologies that provide visualization capabilities, scanners and buses, management, direct user update, and other capabilities which would have to be duplicated were an RDF/OWL store used for the MDR. While RDF/OWL are not new technologies (they’ve been around for at least 10 years) – semantic technologies surrounding these standards might not be as mature. For example, SPARQL can’t yet perform aggregate functions (e.g., SUM, MIN, MAX) or DML operations (insert, update, delete) and so managing the RDF data store requires other means to perform these operations. However, these and other issues should be addressed with SPARQL 2. Again, there is the issue of indeterminate or ragged hierarchies as mentioned above. There are ways to get around these problems – but a mature MDR tool might not face these same issues.
Visualization capabilities of MDR tools are not to be taken for granted – for example, you should be able to see graphically where the data for a report field came from and how it was calculated and transformed along the path back to the original source.
Metamodeling takes data modeling to the next level – but ontology modeling in OWL is even more challenging (but can provide for extremely powerful benefits). Only a mature organization with semantic modeling and programming resources and with at least a couple of successful RDF/OWL projects under its belt should consider RDF/OWL for an MME MDR. Even then, the scope for the initial project should be sufficiently narrow in order to be successful.
For these reasons, an MME with an RDF/OWL based MDR is probably not the best choice at this time for an MME. Interestingly, it doesn’t seem like MDR tool vendors have RDF on the radar – even for sourcing RDF/OWL meta data or serializing MDR meta data out to RDF/OWL. I performed a search on a few prominent MDR vendor websites for the term “RDF” and was surprised to not get any hits on several of these. With the visibility RDF (e.g. data.gov) and OWL are receiving these days, these technologies should definitely be on the MDR tool vendor’s radar.
The best approach in my opinion at this time is a hybrid one where a traditional MDR tool is used so that these mature technologies can support a robust and possibly mission critical MME, but with a mechanism to be able to serialize meta data to an RDF/OWL format (e.g., RDF/XML, N3) so that our meta data can be used to automate deductive reasoning by inferencing.
Serializing meta data into RDF/OWL from an MDR tool repository might be more or less difficult based on the storage platform the MDR tool uses. For example, if the MDR tool uses an RDBMS, then serializing the meta data using a tool such as D2R Map will probably be less difficult than if a proprietary platform is used. At a minimum, MDR meta data could be exported to XML (e.g., XMI) and translated into RDF/OWL via XSLT or other means.
In the hybrid approach, we’re sacrificing centralization – our meta data would be found in the MDR repository and (at least some of it) in an RDF/OWL format. However, the payoff of having both a mature MDR technology for meta data sourcing, integration, visualization, and management and a new way to be able to leverage meta data for powerful knowledge discovery using data and meta data together seems to make this the best choice at this point in time.
Low Cost, Distributed MME?
Acquiring an MDR tool is not inexpensive. Collecting meta data into a central repository can pose funding and political issues. Could RDF/OWL be used instead to provide a low cost, distributed option for an MME? RDF/OWL is already being used extensively for Linked Open Data across the internet. Why not use this same approach where meta data sources are translated to RDF/OWL and integration enabled via URI references? A registry (using RDF/OWL) would identify where these resources are and bridge between these meta data sources. This is definitely an option and even an appealing one – though questions around management (e.g., what if an RDF meta data source is unavailable, how often will the meta data in RDF source be refreshed) need to be taken into account. With this approach meta data can be locally managed. Local teams simply expose the meta data they choose to make available as RDF/OWL and then go on their merry way and don’t have to worry about that nasty enterprise group asking for access to their system and possibly affecting their architecture.
Semantic technologies such as RDF/OWL and SPARQL are meant to gracefully enable integration across numerous distributed sources. Of course, the greater degree of standardization the easier the integration task will be. For example, if a Data Governance Board approves a list of vocabulary elements (e.g., from Dublin Core, FOAF, etc) which are standard then of course this will minimize the amount of modeling required to map the varying terms and properties. But in real life, for whatever reason (e.g., different software packages) things will be named differently. Using equivalence features of OWL we can state that dc:creator owl:equivalentProperty xyz:author and so easily be able to find all the books or articles written by a certain person.
Meta data is already an important part of the Semantic Web. While semantic technologies for enterprise meta data management might not be ready for prime time currently, this is definitely something to keep on the radar. In any case, we should begin looking into using semantic technologies to enable deductive reasoning against data and meta data to discover new knowledge. Many organizations still do not have enterprise MME’s – perhaps coupling MME’s with semantic technologies can enable killer app’s to make our enterprises more competitive, profitable, agile, and efficient through the increased automation that MME’s and semantic technologies can enable.
About the Author
Pete Stiglich is a Principal Consultant/Trainer with EWSolutions with nearly 25 years of IT experience in the fields of data modeling, Data Warehousing (DW), Business Intelligence (BI), Meta Data management, Enterprise Architecture (EA), Data Governance, Data Integration, Customer Relationship Management (CRM), Customer Data Integration (CDI), Master Data Management (MDM), Database Design and Administration, Data Quality, and Project Management. Pete has taught courses on Managed Meta Data Environments (MME), Data Modeling, Dimensional Data Modeling, Conceptual Data Modeling, ER/Studio, and SQL. Pete has presented at the 2008 MIT Information Quality conference, 2007 and 2008 Marco Masters Series, at DAMA at the international and local level, and at the 2007 IADQ Conference. Pete’s articles on Information Architecture have been published in Real World Decision Support, DMForum, InfoAdvisors, and the Information and Data Quality Newsletter. Pete is a listed expert for SearchDataManagement on the topics of data modeling and data warehousing. Pete is an industry thought leader in the field of Conceptual Data Modeling. He can be reached at email@example.com