Building an Ontology, or: Don’t Put the Cart Before the Horse

I got so excited about creating a linked data set that I tried to jump right into populating a triple store…whoops! As I was reading about the various triple store systems and serializations of RDF, it became clear that, even though this is intended to be an iterative process, the data was not yet up to snuff for even an initial conversion to RDF. I had basically just copied all the information straight from our bucket diagrams into an Excel spreadsheet without really thinking about how the data should be modeled. Here’s a sample from the first pass at data-gathering:

Initial Data from Bucket Diagrams as a Spreadsheet

Initial Data from Bucket Diagrams as a Spreadsheet

You can see how I tried to go about creating columns for each attribute of a given bucket (What content type does it belong in? Does it included born-digital content? How about digitized?). There are many, many more columns not shown in this picture, each representing some aspect of the conventions in our diagrams (there’s a sample diagram here if you missed that post or forgot what they look like). The diagrams, which are inherently simple visual representations, couldn’t be too detailed or they would have quickly gotten very cluttered. In the data, however, we’re free to be much more specific and, most importantly, flexible. For example, rather than listing born-digital and digitized as two separate properties with yes/no statements, we can have a field for “creation method” that may include born-digital, digitized, and/or some additional yet-to-be-determined method.

Realizing that I needed a better way to approach this process, I backtracked a few steps to the data modeling phase, which led me to some Google searching for ontologies and ontology creation. The results were not pretty, but through a lot of twists and turns I ended up at Stanford’s Protégé website, which not only develops and hosts the fantastic Protégé software (and web service!) for ontology development, but also provides excellent tutorials on creating and working with ontologies. Jackpot!

So without further ado, I started reading through their Ontology Development 101 tutorial, which is written as though the reader has no prior ontology-creation experience. Just my speed. By following the process they outlined, I’m well on my way to creating a class hierarchy for a digital content management ontology, to be shared in future posts. I also found several existing ontologies that I can pull from…no need to reinvent the wheel. It looks like there are definitely terms for a lot of the classes, entities, and properties we need, but there is also information we’re modeling that doesn’t seem to exist in a current ontology. At least not that I could find. Here are the most useful related ontologies I’ve found so far. If you know of others I should look at, please share in the comments!

Following the Protégé tutorial’s step-by-step process has been incredibly useful. It very clearly describes how to break down the information in terms of entities with properties and relationships to each other, as opposed to a flat table. In my next couple of posts I’ll talk about the classes that have emerged to describe our digital content, and some of the initial properties we’re interested in recording.

Digital Content Management Initiative: It’s All About the Data

One of the major projects I’m working on in the fellowship is the Libraries’ Digital Content Management initiative, or DCM for short. This project started as a library-wide initiative in fiscal year 2013, focused on reviewing the MIT Libraries’ digital content to ensure that we are able and ready to manage it over time. The digital content review process we followed was developed and led by Nancy McGovern, Head of Curation and Preservation Services, and coordinated by a Life Cycle Group representing several departments across the Libraries. There’s more information about the DCM initiative on our Life Cycle Management libguide, and we’ll be posting updates on the process and results of year one soon. In the meantime, I wanted to share some work I’m doing as part of the next phase of the project now that FY13 is over.

The first part of the digital content review involved completing overviews of nine different content types managed by the Libraries: architectural and design documentation, digital audio, digital video, e-books, faculty archives, geospatial information, research data, theses, and web content. We also identified three additional content types to review in the coming year: institute records, scholarly record, and visual resources. In the process of creating these overviews, we gathered a lot of information about each content type. For the first phase of the project, that information was compiled into written overview documents and visualized as diagrams showing each of the buckets, or categories of items, within a content type. Here’s an example for our Architectural and Design Documentation content type:

Architecture and Design Bucket Diagram

Architectural and Design Documentation Bucket Diagram

These diagrams and overview reports have been very helpful for us to get a birds-eye view of all the digital content the Libraries manage (or may manage in the future). For the next phase of the project, we want to take all this information and turn it into a data set that can be queried, visualized, and added to as it changes over time. As you can see, there’s a lot of data already there; for example we have information on who currently manages the content, how much content we have,  whether it’s born-digital or digitized, and much more in the corresponding overview document.

Before we can create a data set from all this information, however, we have to decide how we want to model the data. I’ve been reading a lot about linked data and RDF (short for Resource Description Framework) over the past few years, and this data seems like it might be a good fit for the graph model for a few reasons:

  1. we need flexibility, as we don’t know yet how much more data we’ll gather or what shape it will take as we go
  2. it will allow us to easily link to existing digital preservation standards and services such as the PREMIS Preservation Metadata Ontology and the Unified Digital Format Registry (UDFR)
  3. one of our primary interests is exploring the relationships between the content types, a task for which linked data is well-suited
  4. using a web-based method for managing this information could simplify future sharing and data exchange

So the next steps in the DCM initiative present an opportunity for me to kill two birds with one stone: convert the qualitative visual and textual information we’ve compiled into a data set, and learn the process of creating and working with a triple store. This is definitely an exploratory undertaking. It may turn out that a simpler model, even a very basic relational database, will be better suited to this particular set of data. And I will definitely have to stretch my technology skills to learn how to create and query a triple store using whatever documentation I can find. But this is part of the joy of being a fellow: exploration and learning new skills are all part of the deal!

I’m going to share all the details of the process as I go, so feel free to follow along on the blog, make suggestions when I get stuck, and/or laugh (in a friendly way, I hope) when I make silly newbie mistakes. It’s going to be a very interesting project.