I got so excited about creating a linked data set that I tried to jump right into populating a triple store…whoops! As I was reading about the various triple store systems and serializations of RDF, it became clear that, even though this is intended to be an iterative process, the data was not yet up to snuff for even an initial conversion to RDF. I had basically just copied all the information straight from our bucket diagrams into an Excel spreadsheet without really thinking about how the data should be modeled. Here’s a sample from the first pass at data-gathering:
You can see how I tried to go about creating columns for each attribute of a given bucket (What content type does it belong in? Does it included born-digital content? How about digitized?). There are many, many more columns not shown in this picture, each representing some aspect of the conventions in our diagrams (there’s a sample diagram here if you missed that post or forgot what they look like). The diagrams, which are inherently simple visual representations, couldn’t be too detailed or they would have quickly gotten very cluttered. In the data, however, we’re free to be much more specific and, most importantly, flexible. For example, rather than listing born-digital and digitized as two separate properties with yes/no statements, we can have a field for “creation method” that may include born-digital, digitized, and/or some additional yet-to-be-determined method.
Realizing that I needed a better way to approach this process, I backtracked a few steps to the data modeling phase, which led me to some Google searching for ontologies and ontology creation. The results were not pretty, but through a lot of twists and turns I ended up at Stanford’s Protégé website, which not only develops and hosts the fantastic Protégé software (and web service!) for ontology development, but also provides excellent tutorials on creating and working with ontologies. Jackpot!
So without further ado, I started reading through their Ontology Development 101 tutorial, which is written as though the reader has no prior ontology-creation experience. Just my speed. By following the process they outlined, I’m well on my way to creating a class hierarchy for a digital content management ontology, to be shared in future posts. I also found several existing ontologies that I can pull from…no need to reinvent the wheel. It looks like there are definitely terms for a lot of the classes, entities, and properties we need, but there is also information we’re modeling that doesn’t seem to exist in a current ontology. At least not that I could find. Here are the most useful related ontologies I’ve found so far. If you know of others I should look at, please share in the comments!
- Dublin Core: http://dublincore.org/documents/dcmi-terms/
- FRBR: http://vocab.org/frbr/core.html
- BIBO: http://bibliontology.com/specification#sec-documentation
- AIISO: http://vocab.org/aiiso/schema
- PREMIS: http://id.loc.gov/ontologies/premis.html
Following the Protégé tutorial’s step-by-step process has been incredibly useful. It very clearly describes how to break down the information in terms of entities with properties and relationships to each other, as opposed to a flat table. In my next couple of posts I’ll talk about the classes that have emerged to describe our digital content, and some of the initial properties we’re interested in recording.