Managing Unstructured Data
By Larissa Moss
This article is an excerpt from the book Data Strategy (Addison-Wesley, 2004) by Sid Adelman, Larissa T. Moss and Majid Abai. It is based on Chapter 11 “Strategies for Managing Unstructured Data” written by Majid Abai and reprinted with permission.
According to a 2003 study by the University of California at Berkeley, about 5 exabytes (an exabyte is roughly the equivalent of 1,000 petabytes, 1 million terabytes, or 1 billion gigabytes) of unique analog and digital information were produced worldwide in 2002, twice the amount produced in 1999. That’s a data explosion equivalent to half a million new libraries the size of the print collection of the Library of Congress, and this number will continue to expand exponentially. Although we haven’t seen any further studies, today – in 2009 – and after the massive use of social networks, such as FaceBook, YouTube, MySpace and Twitter, this number must be incredible! IBM estimates that about 85 percent of all data is unstructured and about 50 percent of the unstructured data is duplicated. Therefore, any discussion about a data strategy is incomplete without formulating a tactic for maintaining unstructured data.
Why is it that after all these years, organizations suddenly focus on unstructured data, whether it is internal (emails, documents, pictures, audio files, video files) or external (content on social networking sites such as YouTube, Face Book, MySpace, Twitter, etc.)? The primary reason is because this huge volume of (mostly duplicated) unstructured data costs organizations money in storage and backup costs, not to mention the hidden cost of productivity loss. In addition, because there is no central strategy for managing and retaining unstructured data, it continuously gets duplicated. For example, an engineering organization (an ISO-9000 shop) designed the same part 19 times, and some other parts were reengineered ranging from 2 to 17 times because the organization did not realize they had already manufactured those parts. How could they not know? The answer is simple if you look at it from a data perspective. The organization had manufactured over 5,000 parts for various clients, and although the specifications and designs were maintained on digital format, there was not a centralized strategy for maintaining and searching designed parts. So, when the order came from a client to create a similar part, the engineering department did a cursory search only through the client’s previous orders. When they did not find any matching specifications, they re-engineered the same part – even though it was in the inventory with another name for another client.
The second major reason is that organizations have learned to listen to and enhance the chatter on blogs, forums, and social networks in order to improve their brand recognition and customer service. Utilization of techniques such as viral and conversational marketing in order to motivate brand evangelists and tame unhappy customers have yielded massive ROI to marketing departments. But how is this related to unstructured data? Blogs, forums, and social networks are all about unstructured data. There is hardly an ounce of structured data in social media. The ability to harness, track, and analyze this media to an organization’s advantage is huge.
Another major reason for the push to develop a strategy for managing unstructured data is the Sarbanes-Oxley Act of 2002. There are two sections in the Act that relate to reports and documents. One is Title VIII, which makes it a felony to knowingly destroy any documents to “impede, obstruct, or influence” any existing or contemplated federal investigation. The other section makes it a crime for any person to corruptly alter, destroy, mutilate, or conceal any document with the intent to impair the object’s integrity or availability for use in an official proceeding. Therefore, all public corporations must ensure that relevant documents are maintained in electronic format for 5-7 years and must be readily available for scrutiny by various governmental entities.
In addition to Sarbanes-Oxley, other legislation in the US, EU, UK, Australia, and other governments have forced public organizations to maintain all electronic data (including documents, reports, publications, and even email) for up to 7 years. In one case, a major financial institution estimated having 350 terabytes of just electronic mail in multiple languages that needed to be stored, retained, and searched if and when regulators requested. An example of such a request can contain a search for emails produced by “Joe Smith” between January and March 2001 that contain the word “bribe.” Imagine the time it would take to respond to such a request having no strategy except for the basic backup of shared drives and email servers. Even if the search involved one email server and one shared drive, it would take months to restore, search, and find every piece of email produced by a specific person in that period containing the word “bribe.” CTOs, and more importantly, legal departments cannot wait that long. Yet, in most organizations, unstructured data is created and maintained in content silos where content authors work in isolation from one another. Such isolation causes redundancy, poor communication, lack of standards, and higher costs for creation of the content. In addition, the search and consumption of such content is harder and more costly for both internal and external users because the content sitting in isolated silos is not necessarily inventoried and/or accessible to consumers.
Unified Content Strategy
Ann Rockley, one of the leading consultants in the area of enterprise content management, describes a unified content strategy as “a repeatable method of identifying all content requirements upfront, creating consistently structured content for reuse, managing that content in a definitive source, and assembling content on demand to meet your customers’ needs.” Although an excellent definition, we believe that the organization’s data strategy should anticipate that new data types, content requirements, and data sources – both structured and unstructured – will be added in the future. As such, the strategy should be standardized to remove the risk of content silos and yet be flexible enough to embrace various new types of data.
Storage and Administration
In the past 30 years, the world of structured data has evolved from disparate files into central databases administered by database management systems or DBMS. All DBMS have been designed to perform the same types of tasks regardless of whether they are in a hierarchical, network, object, or relational format. They have methods for storing, manipulating, searching, and retrieving records. In addition, all DBMS allow systematic backups and restores, data reorganizations, index management, and performance management, as well as securing the data against unauthorized access.
As unstructured data goes mainstream in organizations, we expect the same evolution to happen in managing unstructured data objects. The recognition of unstructured data as a vital corporate asset has forced leading IT departments to utilize enterprise content management systems (ECMS) for managing all non-structured data objects. Such systems allow for storing, administering, searching, securing, and retrieving content from a centralized base in the same method that a DBMS allows administering structured data. ECMSs provide a repository for unstructured objects, allow for capturing metadata for each object for future searches, and create relationships among objects. They also provide centralized check-in and check-out functionality, backup, restore, and disaster recovery capabilities, and they enable users to search across various objects, and secure objects from unauthorized access. Such products often provide tasks that fall outside the scope of structured DBMSs, such as versioning, retention, and archiving. They also provide electronic workflow management capabilities to end users. Electronic workflows simplify the routing of documents and other electronic objects. A great example is used in a mortgage company when an applicant submits an application online, which is reviewed by a mortgage processor and then electronically forwarded to the manager for approval.
Why not use a DBMS to manage unstructured data objects? After all, most DBMS products allow for capturing binary large objects (BLOBs) or character large objects (CLOBs). There are several reasons for not utilizing structured DBMSs. First and foremost, these DBMSs are not designed to manage content. They are structured data management tools designed to relate relational sets of tables and columns. Unstructured data objects do not fit into tables and columns. How would one store many sections of a manual, including pictures, diagrams, indexes, footnotes, fonts, and formats in tables and columns? Secondly, DBMSs are not designed to manage versioning, retention, and archiving of objects. Another reason is the size of the database and its management. There are usually two ways to store LOBs in a DBMS: store it in the database or store it on the file system and store the link for the LOB in the DBMS. If one uses the first technique, the size of the database is gigantic and basic functionalities, such as backup and restore, take a huge amount of time. If the second option is chosen, the integrity of the database is compromised because the typical DBMS does not put a restriction on viewing, changing, or deleting the object file, and therefore, an unauthorized person can effectively view or remove a secured file.
Archiving and Retention
Archiving is an important function in the unstructured data world. The administrators of unstructured data should plan for a time when the active objects (or old versions) should be archived and/or removed from the repository. Archiving methods are slightly different for unstructured data than its structured relative. On the structured side, archiving is usually an afterthought. Several years after deployment of the original database, we are requested to archive data in a format that can be retrieved later. With a typical DBMS, there is no out-of-the-box way to archive the basic records. In an ECMS, there is usually an archive flag associated with each object that signals the object to be archived.
Retention requirements have applied to non-electronic records for years. Records retention varies from organization to organization, department to department, and even document (object) type to document type. The same requirements have now been expanded to include electronic documents. To reduce the risk of deleting required records, the current approach is to generally remove the responsibility for retention of an object from all employees and to centralize it in a small group with only a few administrators. Utilizing specialized records retention software, the administrators create various categories and subcategories that are several layers deep. They can then set up retention rules for each layer. The rules vary from a certain period after the creation of the object to a number of years after the termination of its author to dependency on another object. The same records retention software is used to identify the new and changed files in the network drive (or email server) and then to group them into categories and subcategories (automatically or with the help of administrators). This software utilizes internal security to prohibit everybody except authorized users from deleting the object. Any modification to the object will take place in the form of a new version and is subject to all mentioned scrutiny. All the major ECMSs have introduced their own version of records retention that not only allows for centrally managing the content but also managing the retention rules.
One of the most important aspects of managing unstructured data is the need and ability to reuse an existing content component in a new object. For example, an organization needs to create a website and printed marketing material; however, it does not need to create content for each of the objects separately. It can create a set of content for one, and then use paragraphs, pictures, graphs, and even sentences from one in the other. Utilizing this method, a content change to a specific object automatically changes the content on the other, ensuring the consistency of the message across the organization.
Search and Delivery
Structured data is stored to be searched, manipulated, and viewed. Unstructured data objects are no different. Therefore, your unified content strategy should include methodologies for easy access, search, and delivery of content from various ECMS tools in the organization. In addition, all enterprise-level ECMS tools allow explicit or implicit capture of metadata to enhance future searches. They also provide application programming interfaces (APIs) for developers to search the metadata of a specific object. Most ECMS tools also provide caching methods for faster search and delivery of the content to end users. Since users access data not only on their PCs but also on their home computers, laptops, PDAs, and cell phones, ensure that regardless of the ECMS tool used, its search and delivery methods are standardized.
Although content has been around for a long time, it’s been just a short time since some organizations have started to consider it data. For that reason, there are few technologies related to unstructured data. At the same time, there are some emerging markets and these promise a bright future for products and services in the complete enterprise data management market.
Digital Asset Management Software
The relationship between digital asset management (DAM) software and ECMS is like the relationship of application software (e.g. ERP, CRM) to its DBMS. ECMS is the heart and soul of a DAM system. DAM contains applications that resolve a specific operational – and in the future, analytical – need of an organization by providing automatic batch entry, manual entry, workflow, and of course search, retrieval, and archiving of digital content from an ECMS. A great example of DAM is software developed for the entertainment industry in which the detailed aspects of a movie, including screen play, story boards, still photographs, daily shoots, edited versions, sound tracks, songs, marketing posters, interviews, junket clips, and so on are loaded into a central repository regardless of where the shoot takes place around the world. In this case, the director, the producers, and the executives can track and view the daily work performed by all teams associated with the movie from the comfort of their homes or offices. In addition, the software allows for a complete archive of all digital information related to a movie in one central place, assuring availability of all the data associated with a movie for future generations. Another (currently much discussed) application of DAM is electronic medical records (EMR) software. This application allows medical offices and hospitals to maintain information about patients in a digital format and communicate that information to doctors and other clinics via an electronic format. In addition, the information can allow researchers to access detailed information about a patient and analyze patterns in treatments to suggest better methods of treatment for future patients with the same disease.
Digital Rights Management Software
Copyright laws in the US and most of the world allow the author or rights holder of a document to have a say in the reproduction of her material and allows her to be paid royalties and licensing fees associated with the use of her material. Digital rights management (DRM) software utilizes technological methods to enforce these rights. You have probably heard of the old Napster™. The problem with the original Napster™ was that it allowed people to share copyrighted material without allowing its rights holders to enjoy royalties from the people who downloaded the music files. DRM software focuses on allowing only the users who have the right to utilize specific software to listen to a song or watch a video for as long as they maintain the right. DRM software products also focus on any digital asset that needs to be protected from unauthorized usage, including competition-sensitive corporate assets. Imagine a corporate executive who downloaded a confidential report onto his laptop and then loses his laptop. The person who finds or steals the laptop can break into the system and view the contents of its files. How does DRM protect against these problems? In this scenario, the contract between the executive and the object (the confidential report) can be maintained and is always checked through a corporate license server. When the laptop is lost, the executive will notify a corporate security officer who will (as part of the normal procedure) remove the rights on all sensitive files from the executive’s laptop.
Unstructured data plays a major role in an organization, and utilizing it correctly can improve processes, save lives, and directly enhance the bottom line. As such, the need to capture unstructured data and make it available to other parts of the organization must be an important part of any organization’s data strategy.
Majid Abai is the chief executive of Pringo, an enterprise-class social networking development platform located in Los Angeles, CA. With over 25 years of technology and technology management experience, Majid has focused on delivery of high-level enterprise information management technologies to various groups of clients. He can be reached at firstname.lastname@example.org and @MajidAbai in Twitter.
About the Author
Larissa Moss is president of Method Focus Inc., and a senior consultant for the BI Practice at the Cutter Consortium. She has 27 years of IT experience, focused on information management. She frequently speaks at conferences worldwide on the topics of data warehousing, business intelligence, master data management, project management, development methodologies, enterprise architecture, data integration, and information quality. She is widely published and has co-authored the books Data Warehouse Project Management, Impossible Data Warehouse Situations, Business Intelligence Roadmap, and Data Strategy. Her present and past associations include Friends of NCR-Teradata, the IBM Gold Group, the Cutter Consortium, DAMA Los Angeles Chapter, the Relational Institute, and Codd & Date Consulting Group. She was a part-time faculty member at the Extended University of California Polytechnic University Pomona, and has been lecturing for TDWI, the Cutter Consortium, MIS Training Institute, Digital Consulting Inc. and Professional Education Strategies Group, Inc. She can be reached at email@example.com
References and Additional Reading
Adelman, Sid, Larissa Moss and Majid Abai. Data Strategy. Upper Saddle River, NJ: Addison-Wesley, 2005.
Rockley, Ann et. al. Managing Enterprise Content. Indianapolis, IN: New Riders Publishing, 2003.
Reimer, James Ph.D. and Chief Architect. “Enterprise Content Management Products,” IBM in Content Management and WebSphere Portal Technical Conference, 9 June 2003