PDF Analysis Tools: Part 2 of the PDF Experiment

A while back I posted about the PDF experiment I’m doing to explore some character encoding anomalies in our PDF e-theses. The project is complete now, but I wanted to outline some of the steps we went through before I jump right to what we found. Step one was analyzing the PDFs to figure out what, exactly, was going on underneath the hood. But in order to do that we actually needed a preliminary step, which was finding a tool that would help us do that analysis effectively.

This actually turned out to be more difficult than it seemed at first. The errors we were looking for aren’t related to the common issues digital preservationists are most often concerned with, such as format identification and validation. Tools like JHOVE weren’t quite right for our purposes, because the errors we encountered were all deep within the content stream. We had plenty of perfectly valid PDFs (checked against JHOVE just to be sure) that would render fine in a PDF reader, but the text came out with unusual characters if copied and pasted into another document (any other document, we tried all the text editors we could find).

So our challenge was to find a tool that would help us analyze the text encoding within the content stream. It turned out Adobe Acrobat Pro was the most useful tool for manually inspecting the errors, since its preflight function allowed us to view the actual content stream encoding at the point in the document where we had a known error, such as a ligature that wouldn’t convert to its correct ASCII characters on copy-and-paste. However, this required that we already know the PDF had said error, and what we really wanted was a tool that would help us identify PDFs with those errors from a larger collection of PDF e-theses.

We tested 16 tools to see if any would solve this problem for us. Questions we asked as we tested each tool included:

  • Can the tool accurately and reliably diagnose the problems we are working on?
  • Does it have repair capabilities, or is it diagnostic-only?
  • Can it be run as a batch process (from the command line) or does it require a GUI?
  • Does it run on Linux, Windows, Mac?
  • Can it report on whether the document meets archival standards like PDF/X, PDF/A etc.?
  • Who creates and maintains the tool?
  • Is there good documentation/support?
  • Is there a cost?
  • Is it open source?

We were hoping to not only find a tool to suit our needs, but one that was free and open source if possible. For each tool, we went through the following steps for initial testing:

  1. Install the tool on local machine or server.
  2. Run the tool on each of 32 sample PDF theses that represent the different types of problems we had encountered.
  3. Record results/output and comments about the tool’s functions in a shared spreadsheet.
  4. After testing the tool and getting a feel for how it works, record general observations and answers to the questions above on a wiki page for the project.

After going through this process for all 16 tools, we hadn’t found a tool that did exactly what we needed. But we found that there were two tools whose text extraction features could be used to serve our purposes. One, PDFBox, was able to work with the ligature issue for most of the PDFs, providing cleanly extracted text that resolved to its correct ASCII characters. This was helpful in two ways: one, it led us to believe that the issue we were dealing with was also one of rendering rather than strictly encoding (whatever the encoding is for those ligatures, it can be converted correctly by at least one program); and two, it gave us an option for extracting readable text for both cataloging and indexing in our repository.

The other tool, PDFMiner, helped us by, in a sense, not working: on extracting the text, every single PDF that we had identified as one with ligature issues showed those ligature issues in the extracted text. Therefore, we decided that whatever the encoding is that’s causing the problem, PDFMiner is not able to resolve it to the correct ASCII characters. While this might sound like a deficiency in the tool, for our needs it actually allowed us to create the exact tool we were looking for!

My colleague Rich Wenger wrote a series of scripts that will extract text from a PDF using PDFMiner and then search for the anomalous characters that appear when there are known text encoding issues present. This allowed us to run his scripts over the entire collection of e-thesis PDFs to see how many have underlying text encoding issues. After running these scripts, we determined that about 30% of the collection presents the ligature encoding error, and a much smaller percentage presented other errors such as excessive line breaks and extra characters inserted in the encoded text. In my next post, I’ll discuss possible implications of these results and some of the steps we’re taking to mitigate these issues in the future. Stay tuned!


Preserving Your Family’s Digital Legacy: A Talk at the Winchester Historical Society

A few weeks ago I had the great pleasure of speaking at the Winchester Historical Society on preserving personal digital documents. It was great fun, with a very engaged and interested audience and tons of really good questions. I posted my slides and handout online, but since the slides aren’t really a complete representation of my talk on their own, I thought I’d also share my script for the talk here. And yes, in case you’re wondering, I typically do actually write out a script for most of my professional presentations. I don’t usually follow it exactly, but the process helps me get my thoughts in order so I can present them in a cohesive way. So without further ado, here’s my talk. Many thanks to Nancy Schrock and the Winchester Historical Society board for inviting me to speak!

Keeping Memories Alive: Preserving Your Family’s Digital Legacy

My talk is titled “Keeping Memories Alive: Preserving Your Family’s Digital Legacy,” and I want to start off by talking about why this even matters. I don’t have to tell you that preserving your family’s legacy is important, but preserving their digital legacy? When I talk to people about what I do, they often ask “but why do we need to preserve digital stuff? It’s digital, it’s online, it’s there forever!” And that would be nice, but it simply isn’t true. If any of you still have working files that you originally created on this machine (see image below), you’re in the minority, and you probably don’t need to be at this talk.

Photo credit: jurvetson via Compfight

Photo credit: jurvetson via Compfight cc

There are many ways digital record can be permanently lost. These include everyday accidents: you accidentally delete a file; you lose your digital camera, your cell phone, or the flash drive you were carrying around in your pocket; your computer crashes, etc. There are  natural and human-caused disasters that can destroy the hardware that your digital files are stored on. There are security and privacy threats, both on a small scale, say, personal theft of your phone or computer, and on a larger scale in the form of attacks on service providers that you may use to store some of your digital assets. For example, you all may have heard in the news this week that there’s a major internet security vulnerability called heartbleed that’s affecting pretty much the entire internet right now.

There’s also the potential for some of those same service providers to go out of business. This company is one example. Nirvanix was a cloud storage provider that went out of business in October of last year, after seven years of providing service. Fortunately, they contracted with IBM to help clients move their content over to a different storage provider, but unfortunately they only gave their customers two weeks notice before they shut down their servers and closed up shop. Let that be a warning about the realities of the tech business world.

Another threat is the rapid evolution of technology itself, which means that the hardware and software you use to access your digital files in 30 years will look abd function nothing like the hardware and software you use today. The term we use for this is obsolescence…the technology you’re using now will almost certainly become obsolete in the relatively near future. And finally, perhaps the scariest-sounding concern is what we call bit rot, a process by which the actual 1’s and 0’s, the bits that make up every digital file, simply degrade over time. This can happen either as a result of alterations in the electric charge of stored memory, or as a result of the storage media itself decaying, which is common with CDs and DVDs.

But before anyone starts to panic, let me calm your fears and say that there are strategies to avoid, mitigate, and prepare for each of the threats I mentioned. And I’m going to share them with you! But first, I want to set the stage a little bit with a framework, an approach for dealing with digital content. Because one of the biggest challenges with digital content, what I hear from just about every person I talk to, and that I struggle with myself, is that there’s just so darn much of it! It can be really overwhelming to even figure out where to start when it comes to preserving digital records. They’re everywhere, and a part of just about everything we do.

So I’m going to introduce you to a concept we use in the library digital preservation field, called the life cycle model. This model outlines a cycle of stages in the process of preserving digital objects, and in many ways it’s actually very similar to the process you might go through when sorting, storing, and saving print materials, but it’s extra helpful when dealing with the huge quantity of digital stuff you’re likely to have. I’ve created a somewhat simplified version of this model that takes out some of the professional jargon.


Simplified Life Cycle Model

Simplified Life Cycle Model

The cycle starts with creating the files, which may be done by you or by someone else, depending on what you’re trying to preserve. The next step is gathering all the digital content you know of so you have a starting point for assessment. Then there’s selection, and this is key, because the truth is you can’t save everything, and you probably wouldn’t want to even if you could, so you have to select what’s really important to preserve. Many files require some sort of processing before they’re ready for preservation. There’s organizing, which isn’t always fun but is really important, especially for digital files. Then you have to actually store the files, and just as with most things you care about and want to keep long-term, there’s periodic maintenance required to keep it in good shape. And finally, because this is a cycle, it doesn’t end…you’ll want to review the content you have to make sure it’s still worth keeping, and repeat the whole process for new digital content you create or acquire.

So now I’m going to go through each of these stages and talk about some of the important steps to take and things to consider when you’re making decisions about how to carry them out. Let’s walk through the life cycle model again in a little more detail. And as we do this, I’m going to try and point out particular considerations for some of the most common types of digital items people are usually concerned with: photos, documents, and videos.

Alright, so the first stage in the life cycle is creating the files. The biggest decision you can make at this stage is what file format you’re going to save your file in. For example, your digital camera probably has output options for which file formats it will export, mostly likely jpeg and possibly tiff or raw image format. Your phone may have similar options for format and quality or resolution of photo and video output in its settings (note that the iPhone does not have settings you can change for format or image size), and of course you have many “save as” format options when saving documents. There are no hard and fast rules about what format to use, but there are some best practices. Whenever possible, you want to use open, non-proprietary formats because the specifications for those are maintained by standards organizations and are publicly available to anyone, which means it’s not dependent on one company who owns and controls the format and may go out of business or stop updating the format to work with current technology.

You also, ideally, want a format that is uncompressed, because when a file format is compressed, the algorithms that do the compression remove data from the file to compress it. That data, once removed, can’t be recovered, unless you’re using a special type of compression called lossless compression. You also want a file format that’s been around for a while and proven its utility, and one that is commonly used, because the more people there are using it, the less likely it is to become obsolete rapidly and without warning.

Common File Format Properties

Common File Format Properties
* Proprietary doesn’t necessarily mean closed, but it does mean it’s owned by a private company.
** Audio and video formats are usually wrappers that can contain either compressed or uncompressed encodings of the content.

This may be more technical information than you need, but here are some common file formats and their how they measure up on these characteristics. You’ll note that there’s no perfect file format; often popular formats are compressed formats, because they’re intended to be portable (such as MP3). You have to weigh the options and choose the file format that seems best for the content you have. For example, if you’re dealing with video files, you may choose to use a popular, high-quality, but compressed format because uncompressed video tends to be very large and it can be really expensive to store a lot of uncompressed video. There are a few other important considerations when creating files for long-term preservation. The resolution of images can impact the quality and size of the images if you ever want to print or reformat them. Similarly, the encoding of audio and video drastically impacts the quality, but again here there’s often a trade-off between the quality of the sound or image and the size of the file, which impacts the cost of storing it.

So, let’s say you’ve created a bunch of digital files, your friends and relatives have sent you a bunch of digital files, and you know you want to preserve at least some of them. What’s the next step? Collecting them all. This is actually trickier than it sounds. Files may be stored on your current computer, your previous computer, external hard drives, flash drives, CDs, websites you uploaded pictures or documents to, your email inbox as attachments, your other email inbox as yet more attachments, and elsewhere. Depending on how long you’ve been collecting content you might have things like this lying around in storage.

Unfortunately, I can’t tell you where all your digital content is, but it would be worth your time to sit down and really think about the various places you might have important digital materials stored. As you’re doing this, you might also think about non-digital items you have that would be better off converted to digital format. Certain home video formats, VHS tapes and Super 8 film, for example, can degrade fairly quickly under typical storage conditions, and the technology to view them is already obsolete and becoming hard to find. So if you have any of those you might want to pull them out and put them in a “reformatting” pile.

So once you’ve collected everything, whether literally into a pile or just mentally into a list, then you hit the very important stage of selection. You have to pick and choose which digital files you really want to preserve. Digital preservation is a process of ongoing management, so make sure that everything you choose to save is content you’re willing to spend time and money to preserve. Do you really need to keep 600 photos of the same sunset from your trip to Hawaii? If you’re a professional photographer, maybe, otherwise probably not.

When doing the selection, here are some general suggestions for things you might want to preserve: of course, anything that’s truly unique or irreplaceable. You probably have files that are part of your personal record, such as digital copies of birth certificates, marriage licenses, transcripts, etc. You probably also have files you’ll want to refer back to or use again, and here you may want to distinguish between files you want to make sure are preserved for short-term use, vs. files that you actually want to keep long-term. And finally, you’ll probably want to limit the items you invest time and resources into preserving to those you created and own. That means the album and movies you bought from iTunes probably shouldn’t be your highest priority. They are replaceable, and in fact there may be some digital rights management in place that prevents you from reformatting those files or making more than a few copies.

And speaking of reformatting, that may be part of the next life cycle stage, which is processing. Once you’ve selected the digital files you want to preserve, you may encounter some that weren’t created in the ideal format for long-term preservation. Maybe you created them with sharing in mind, rather than preserving. Maybe someone else created them and gave them to you. Regardless, you may want to think about converting them (which goes back to the file format information I shared earlier).

Photo credit: old_skool_paul via Compfight cc

Photo credit: old_skool_paul via Compfight cc

You might also want to consider encrypting certain files as you go. If you have documents that contain sensitive or confidential information, you can password protect them. Software like Microsoft Word and Excel will allow you to encrypt documents, Windows and Mac both have built-in encryption software, and there are also third-party tools that will allow you to password-protect files of any type. Note, however, that encryption is inherently risky, because if you forget the password you won’t be able to access the file! So if you choose to encrypt files, you definitely want to write down the password on paper and keep it in a safe place, separate from the digital storage location.

Organizing your files can include a number of steps, which often starts with file naming. Using generic file names, such as a date or the sequential number that typically comes off your digital camera, can result in accidental deletion or overwriting the file because you can easily end up with multiple files having the same name. You’ll be in much better shape if you use file names that are unique and descriptive of the file’s contents. However, you don’t want to be too descriptive because there are limits to file name length, and if you’re transferring files between devices or over a network sometimes the end of a file name can get cut off. So it’s also a good idea to have the most unique information in a file at the beginning or left side of the file name, in case something does get cut off the end. Generally speaking, it’s also better to avoid spaces and special characters. Dashes and underscores are both ok, and you can use those instead of spaces in between words. And you probably know this already, but it’s important not to delete or change the file extension, because that can make it much harder to open the file down the road.

In addition to naming your files well, you’ll want to organize them carefully in a file or directory structure, which just means how you name and organize the folders in which you keep your files. The folder structure should be simple and meaningful to you, but also easy to understand. Short, descriptive folder names are good, maybe categorized by the type of file such as photos, videos, and documents. Some people find organizing files by year to be helpful when dealing with financial or other documents. One thing to consider is whether your organizational structure would be clear to someone else trying to find something in it, and also whether it will make sense to you in five or ten years.

Another really useful tool you can use to organize your files for the future is tagging or embedding information – this is what we in the library field call metadata, which really just means, helpful information that tells you more about an item. You’re all familiar with this concept even if you don’t call it that. Here’s an example from the print world: my grandma would always carefully turn over every photograph she had developed and write, in pencil, the names of every person in the picture, the date the picture was taken, and the event happening, if there was one. That’s metadata! It’s really useful information to have, especially for non-textual materials like photos, audio, and video, where you can’t search the contents of the file like you can with a text document. But for reasons of length that I mentioned before, it’s way more information than you would want to try to cram into the file name.

Some common metadata elements include the date, subjects, location, purpose or event, and for documents, the version. Note that a lot of this can be automatically added by your device. The date and location, for example, are often automatically embedded in pictures taken on your phone or GPS-enable camera. But you will have to add some information yourself. There are many tools that will allow you to embed metadata in files, and your handout includes links to a resource guide created by the MIT Libraries’ digital archivist that has a much more extensive list of such tools. But here are a few that will help you tag photos: Adobe Bridge, which is cross-platform, Apple iPhoto for Macs, and Windows Photo Gallery for PCs.

The next stage in the life cycle is storage, and this can be overwhelming because there are a lot of options. I’ll get to those options, but first, the absolute most important point I can make about storage is that redundancy is good! You want to have multiple copies of all your files, meaning identical copies on two or more separate storage devices. For very important files, more copies would be better. Six copies is the recommendation for institutional preservation, although that might be a little excessive for most personal files. When making your copies, put them on different types of storage media (and I’ll talk more about the preferred types of storage media in a minute). And finally, try to incorporate some kind of geographic distribution. If you can, have a copy in one location, say your house, and another copy at a trusted friend or family member’s house, or at work. This kind of redundancy is far and away the best prevention you have against loss due to theft, natural disaster, or failure of a particular piece of hardware.

External Hard Drives

Photo credit: Forrestal PL via Compfight cc

Now I’m going to talk about some of the different storage media options, and here are the factors that you might want to think about when you’re deciding between the various choices. Some are more expensive than others. Some might be less expensive but require more time invested into copying and maintaining files. Some options are better for very large amounts of data, but they may also require more technical expertise, so consider your own ability and comfort level with various technologies as you think about how to store files. The other thing to consider is the purpose of the file storage. There are some files that you may want to keep on your current computer, and always move them to your newest computer because you need ready access to them frequently. You may have other files that you want to save but don’t necessarily need ready access to, so those might do better stored somewhere other than your active computer.

There are a couple of local storage options, and the first of those is a spinning hard disk drive, which can be either internal or external to your computer. Hard disk drives or HDDs for short are relatively inexpensive and getting cheaper, but they can also be time intensive: you have to remember to attach them if they’re external drives, copy over the files, make redundant copies and send them to other locations, etc. External drives, as you may already know, are also great for backing up your entire operating system so you don’t lose user preferences and installed software if your computer crashes, and both Mac and Windows machines offer this feature. They can be a great, inexpensive choice for files that you really just need to archive and likely won’t need much ready access to. Note that hard disk drives are susceptible to mechanical failure because they have moving parts, which makes them somewhat less reliable than the other local storage option, which is flash drives.

Flash memory, or solid state drives, function the same way as hard disk drives, but they are faster to use and often come in smaller versions that can be useful for transporting content, you’re probably familiar with these as small jump drives. Because they don’t rely on magnetic, moving parts, they aren’t susceptible to mechanical failures in the same way that spinning disk drives are. However, they are susceptible to firmware bugs which can result in data loss, and the reliability tends to vary by manufacturer. They are also about 10 times more expensive, per amount of storage, than hard disk drives.

Now, just a quick note: Technically media such as CD and DVDs can also be used for local storage, but the preservation community generally does not recommend them for long-term storage because they have inherent preservation issues. CDs and DVDs tend to have very short shelf lives, they’re easily damaged, and since they have a very small amount of storage space anyway, they’re both less reliable and less convenient than hard drives. They can be good for transporting content, they’re just not great for longer-term storage.

Another, increasingly popular option is to move digital content to the cloud, using the internet to transfer files to a storage infrastructure managed by a third party.  Common cloud storage providers include Dropbox, Evernote and Google Drive for documents, and Carbonite, which is a cloud backup service. Cloud storage is usually more expensive than external drives for large amounts of data, although it can be very inexpensive for relatively small amounts of personal data, and the cost varies by service provider. The cost can also vary depending on the use, so some service providers will charge for uploading and downloading content, but the actual data storage cost is very cheap, so be aware of that. One benefit of the cloud storage option is that many services offer an automatic syncing process, which reduces the amount of time you will have to spend copying data.

There are a few other things to keep in mind with cloud storage. One is that you are relying on a third party, which has some inherent risks. This is likely not the best option for storing content that has very sensitive information, because there is the security risk of the company’s servers getting hacked. Always read the privacy policies to see what the company itself can do with regard to your personal digital content. It’s also important to consider that the company may not be around forever, so although they might have robust backup and recovery options with geographically distributed servers and wonderful security, you want to make sure you can access your files if and when you need them for as long as you need. Read the fine print and be aware of what their policies are, and consider that you probably also want to keep one or more local copies of content.

Ok, now we’re done with storage and we can move on to the next stage of the life cycle, which is maintaining. Unlike print materials, which will typically last a very long time if stored properly and left alone, digital content is safest when it gets used, because you can check to see whether it’s still working properly. This means spot checking files of a few different types to see if they’ll open, and playing a video or two to make sure it plays. It’s important to check your stored content occasionally, maybe once a year or so, and if you have any problems accessing the drive or its content, replace the storage media immediately from one of your other copies.

If you choose to use external drives, be aware that both HDDs and SSDs have limited shelf lives. This can vary widely by manufacturer, how frequently the drives are used, and what kind of storage conditions they are exposed to (neither does well in very high temperatures, for example), but you can expect that they will all fail at some point. So it’s important to be aware that you will have to copy your files over to new storage media periodically. This failure is part of the reason I suggested having multiple copies, because it is unlikely that multiple drives will fail at the exact same time, and the more copies you have, the less likely it is that they will all fail together. But regardless, you’ll want to replace your drives about once every five years, and it’s smart to stagger that so you’re not replacing all of your drives at once, but instead replacing one of the drives every 2-3 years.

And finally, the last stage of the life cycle is reviewing, or what we in the library world affectionately refer to as “weeding”. Digital storage isn’t cheap, especially since you have to store multiple copies, so it’s worth going carefully through your files every now and then to make sure you actually want to keep everything you have. There may be financial records, for example, that you just don’t need to keep forever and that you actually might not want to keep around for security reasons.

Preserved! Photo credit: Lisa Ouellette via Compfight cc

Preserved! Photo credit: Lisa Ouellette via Compfight cc

And then you’re done! And all your digital content will be safe forever and you’ll never have to think about it again! No I’m just kidding, now you’re ready to start the cycle over, with creating and gathering the next round of important digital content that you want to preserve. I  know it probably seems like a lot of work, but the digital record we save is the memory we’ll pass down to the next generations, so it’s worth it to take the time and make sure we give them something of value, and that we have something to give them at all.

LaTeX and Theses and Glyphs, Oh My! Part 1 of the PDF Experiment

One of these days I’m going to write some more detailed posts about other projects I’ve mentioned in the past…but not today. Today I’m going to introduce yet another project I’m working on, although one that is about to wrap up. The project is an analysis of text encoding errors in PDF theses submitted to the Libraries for deposit into our DSpace@MIT repository. It’s just one component of a larger set of life cycle experiments currently going on as part of the Digital Content Management Initiative. There will be much more to share about the life cycle experiments over the next year, so more posts to come on that.

In the meantime, a little background on the PDF theses: graduating students here submit paper theses as the official copy of record for the Institute, which are accessioned into the Institute Archives for preservation. We scan the theses and provide electronic access to them via DSpace@MIT. However, if students wish to provide an electronic version of their theses in PDF format, in addition to the required paper copy, they may do so. After some checking to make sure the electronic version matches the official print copy, we can use the e-submission as the online access copy, thus saving us the trouble of scanning it from print.

All well and good, however our meticulous catalogers have noticed some interesting problems appearing in a few of the electronically-submitted PDF theses (locally known as e-theses) over time. Often, these problems were encountered when attempting to copy and paste the text of a thesis abstract from the PDF document into another program, such as the library catalog. Certain character combinations tended not to copy over so well; they would paste into the new program as symbols or blank spaces instead of the expected letters.

Many of the catalogers observed that these characters were often ligatures in the original text, including such common letter combinations as “ff”, “fi”, and “th”. It was unclear what, exactly, was causing this to happen, but there was a suspicion that this problem stemmed from PDFs created by LaTeX applications. For those not familiar with LaTeX (I wasn’t, when I started this project) it’s a suite of typesetting tools built on the TeX typesetting system. This system is often used to write academic papers in the scientific and mathematical disciplines because it can handle typesetting of complex math formulas more easily than word processing programs. Consequently, it is very popular here at MIT.

My job on this project, in coordination with my colleague Rich Wenger, is to identify and characterize the types of problems encountered in our PDF theses, identify the source of these problems when possible, and recommend solutions for avoiding such problems in future e-thesis submissions. We’ve been working on the analysis for several months now, and while we aren’t quite done yet, we have learned a lot about PDF documents, TeX/LaTeX, and the challenges of representing text in a visual format. In part 2, I’ll discuss the process we went through to analyze the PDF documents and develop methods for characterize the text issues we encountered. Stay tuned!

Building an Ontology, or: Don’t Put the Cart Before the Horse

I got so excited about creating a linked data set that I tried to jump right into populating a triple store…whoops! As I was reading about the various triple store systems and serializations of RDF, it became clear that, even though this is intended to be an iterative process, the data was not yet up to snuff for even an initial conversion to RDF. I had basically just copied all the information straight from our bucket diagrams into an Excel spreadsheet without really thinking about how the data should be modeled. Here’s a sample from the first pass at data-gathering:

Initial Data from Bucket Diagrams as a Spreadsheet

Initial Data from Bucket Diagrams as a Spreadsheet

You can see how I tried to go about creating columns for each attribute of a given bucket (What content type does it belong in? Does it included born-digital content? How about digitized?). There are many, many more columns not shown in this picture, each representing some aspect of the conventions in our diagrams (there’s a sample diagram here if you missed that post or forgot what they look like). The diagrams, which are inherently simple visual representations, couldn’t be too detailed or they would have quickly gotten very cluttered. In the data, however, we’re free to be much more specific and, most importantly, flexible. For example, rather than listing born-digital and digitized as two separate properties with yes/no statements, we can have a field for “creation method” that may include born-digital, digitized, and/or some additional yet-to-be-determined method.

Realizing that I needed a better way to approach this process, I backtracked a few steps to the data modeling phase, which led me to some Google searching for ontologies and ontology creation. The results were not pretty, but through a lot of twists and turns I ended up at Stanford’s Protégé website, which not only develops and hosts the fantastic Protégé software (and web service!) for ontology development, but also provides excellent tutorials on creating and working with ontologies. Jackpot!

So without further ado, I started reading through their Ontology Development 101 tutorial, which is written as though the reader has no prior ontology-creation experience. Just my speed. By following the process they outlined, I’m well on my way to creating a class hierarchy for a digital content management ontology, to be shared in future posts. I also found several existing ontologies that I can pull from…no need to reinvent the wheel. It looks like there are definitely terms for a lot of the classes, entities, and properties we need, but there is also information we’re modeling that doesn’t seem to exist in a current ontology. At least not that I could find. Here are the most useful related ontologies I’ve found so far. If you know of others I should look at, please share in the comments!

Following the Protégé tutorial’s step-by-step process has been incredibly useful. It very clearly describes how to break down the information in terms of entities with properties and relationships to each other, as opposed to a flat table. In my next couple of posts I’ll talk about the classes that have emerged to describe our digital content, and some of the initial properties we’re interested in recording.

Digital Content Management Initiative: It’s All About the Data

One of the major projects I’m working on in the fellowship is the Libraries’ Digital Content Management initiative, or DCM for short. This project started as a library-wide initiative in fiscal year 2013, focused on reviewing the MIT Libraries’ digital content to ensure that we are able and ready to manage it over time. The digital content review process we followed was developed and led by Nancy McGovern, Head of Curation and Preservation Services, and coordinated by a Life Cycle Group representing several departments across the Libraries. There’s more information about the DCM initiative on our Life Cycle Management libguide, and we’ll be posting updates on the process and results of year one soon. In the meantime, I wanted to share some work I’m doing as part of the next phase of the project now that FY13 is over.

The first part of the digital content review involved completing overviews of nine different content types managed by the Libraries: architectural and design documentation, digital audio, digital video, e-books, faculty archives, geospatial information, research data, theses, and web content. We also identified three additional content types to review in the coming year: institute records, scholarly record, and visual resources. In the process of creating these overviews, we gathered a lot of information about each content type. For the first phase of the project, that information was compiled into written overview documents and visualized as diagrams showing each of the buckets, or categories of items, within a content type. Here’s an example for our Architectural and Design Documentation content type:

Architecture and Design Bucket Diagram

Architectural and Design Documentation Bucket Diagram

These diagrams and overview reports have been very helpful for us to get a birds-eye view of all the digital content the Libraries manage (or may manage in the future). For the next phase of the project, we want to take all this information and turn it into a data set that can be queried, visualized, and added to as it changes over time. As you can see, there’s a lot of data already there; for example we have information on who currently manages the content, how much content we have,  whether it’s born-digital or digitized, and much more in the corresponding overview document.

Before we can create a data set from all this information, however, we have to decide how we want to model the data. I’ve been reading a lot about linked data and RDF (short for Resource Description Framework) over the past few years, and this data seems like it might be a good fit for the graph model for a few reasons:

  1. we need flexibility, as we don’t know yet how much more data we’ll gather or what shape it will take as we go
  2. it will allow us to easily link to existing digital preservation standards and services such as the PREMIS Preservation Metadata Ontology and the Unified Digital Format Registry (UDFR)
  3. one of our primary interests is exploring the relationships between the content types, a task for which linked data is well-suited
  4. using a web-based method for managing this information could simplify future sharing and data exchange

So the next steps in the DCM initiative present an opportunity for me to kill two birds with one stone: convert the qualitative visual and textual information we’ve compiled into a data set, and learn the process of creating and working with a triple store. This is definitely an exploratory undertaking. It may turn out that a simpler model, even a very basic relational database, will be better suited to this particular set of data. And I will definitely have to stretch my technology skills to learn how to create and query a triple store using whatever documentation I can find. But this is part of the joy of being a fellow: exploration and learning new skills are all part of the deal!

I’m going to share all the details of the process as I go, so feel free to follow along on the blog, make suggestions when I get stuck, and/or laugh (in a friendly way, I hope) when I make silly newbie mistakes. It’s going to be a very interesting project.

It’s Conference Season!

Conference season (aka summer) is upon us, and May was the big month of professional development for me. It started off with a regional National Digital Stewardship Alliance meeting here in Boston, followed by the DigCCurr Professional Instistute in Chapel Hill, NC, and ending with the American Institute for Conservation Annual Meeting in Indianapolis! Here’s a brief summary of each one:

The NDSA meeting was a one-day unconference-style event held at the beautiful WGBH studios in Brighton. A group of digital preservation specialists and interested colleagues from around the New England area got together to talk about the challenges we’re facing and identify ways we can collaborate to solve some of our problems. We started off with short presentations on new initiatives and local efforts in the morning and then split up into groups for discussions in the afternoon. Discussion topics included marketing and outreach to the community, preserving research data, staffing and skills for digital preservation, and digital forensics. The day’s agenda and notes from our afternoon discussions are posted here. We hope to have more of these in the future, so if you’re in the New England area, stay tuned for info on those. Thanks so much to WGBH and Harvard Library for organizing the first one!

The DigCCurr Professional Institute was a week-long workshop on digital curation held at the lovely UNC Chapel Hill campus. This intensive week of training included presentations by an excellent set of instructors, practical labs where we could put the lessons learned into action, and lots of opportunity for conversations with the other digital library practitioners in attendance. We had a great cohort of folks from libraries all over the U.S. and Canada, and it was both a valuable learning experience and a rollicking good time! The week culminated in each attendee selecting and planning a project to complete at his or her home institution. We’ll all return to Chapel Hill in January to report back on our projects and share updates about how we implemented strategies learned in the first session. My six-month project is a review of the preservation metadata in the MIT Libraries DSpace@MIT repository, to clarify and improve our alignment with PREMIS. I’ll be working on that a lot between now and January, so expect updates on the blog!

The final week of May took me to Indiana for the AIC Annual Meeting. I’ve been to AIC several times before, but this year was my first time attending as Chair of the Electronic Media Specialty Group, which meant I was much more involved than in years past. Perhaps I’m a bit biased, but I thought the EMG sessions this year were truly excellent! Some of the highlights included talks about mass migration of media, conserving custom electronic video art equipment, and a new topic of interest for me: documenting source code. Of course, now that the Annual Meeting is over we’re already hard at work planning next year’s sessions. The theme is “sustainable choices for collection care,” which should be very interesting to explore in relation to digital preservation.

I’m so grateful to the MIT Libraries for supporting me in attending each of these events. These experiences have already contributed much to my thinking, project planning, and research interests for the rest of the fellowship and beyond. I can guarantee there will be more posts on some of these threads in the near future!

Preservation Week: A Recap

Preservation Week, in case you aren’t familiar with it, is the time of year when preservation departments across the country emerge from their basement labs and take over the libraries! Ok, that might be a bit of an exaggeration. In reality, Preservation Week was started by the American Library Association in 2010 so libraries and archives could highlight the wonderful work they do to preserve our nation’s cultural heritage. I’ve been involved in planning Preservation Week events since year one, so when Nancy asked me to coordinate this year’s activities for the MIT Libraries, I said, “sure, no sweat!”

Well, maybe a little sweat, but it was made much easier by Ann Marie Willer and Andrew Haggerty from Curation and Preservation Services, who put in extra effort to help me get everything together. This year, we wanted to highlight the three different areas that CPS is responsible for: digital curation, analog preservation, and “hybrid” work that cross the border between the two (which often involves reformatting analog collections into digital). We also wanted to step outside our own department and collaborate with folks in other areas of the MIT Libraries, as well as preservation folks outside our organization. We pulled together a great lineup of events that included all those things, and you can read the details about each of our events here.

What I want to talk about, though, is the process of putting a week like this together. How did we get from “let’s celebrate Preservation Week” to “we have four events planned, advertised, and ready to go”? It’s definitely a big job, but it helps to break it down into steps. We started with brainstorming possible ideas for events, an effort to which the entire CPS department contributed. Then a small group of us went through the loooong list and prioritized, noting which events we were most excited about, which events could we realistically pull together in a few short months, and which events simply weren’t going to happen this year. Once we had an outline of what we wanted to do, then we started contacting the folks we hoped to collaborate with to make the events happen. This year, that included several MIT librarians, as well as MIT senior Shannon Taylor and Conservation Scientist Katherine Eremin from Harvard’s Straus Center for Conservation and Technical Studies.

After all the speakers very kindly agreed to participate and we coordinated the dates and times so there weren’t overlapping events (not easy, since we were also keeping the libraries’ other planned events in mind) the rest was all logistics. We scheduled rooms, set up a webpage, wrote marketing blurbs, gathered pictures and information about our presenters, put up posters, and sent out information to listservs. I have to put in a big plug for the Libraries’ web team and marketing team here, because they were immensely helpful in getting our webpage up and coordinating marketing for all the events. In order to help us all keep track of the various little tasks we had to complete, I created a master task list for the group noting deadlines and responsible parties, which we could all access on a shared drive. We also helped each other stay on target with regular meetings to check in about what we’d done and what was left to do.

The week was a big success, with attendees at the events ranging from MIT students and staff to library professionals from other local institutions. We had a lot of fun, and all of our presenters were just amazing. There were, of course, a few small hiccups. One challenge was the unexpected overlap on Wednesday of Preservation Week with the memorial service for MIT Officer Sean Collier. Since we knew many people would not be able to attend our preservation week event due to this much larger event taking place on campus, we decided to hold a repeat session of that event two weeks later. Fortunately Kari Smith, our presenter for that day, was perfectly happy to do a second session. However, this unforeseen event was a reminder that you can’t plan for everything and it’s important to stay flexible.

I want to say a quick thank you to all our excellent Preservation Week speakers: Katherine Eremin, Peter Munstedt, Kari Smith, Shannon Taylor, and Ann Marie Willer. And another thank you to all the members of Curation and Preservation Services, the Web Team, and the Marketing Team for their help in getting ready. We’re happy we were able to share some preservation joy, and we can’t wait to do it again next year!

Welcome to MIT!


Me in front of my new workplace: Hayden Library at MIT

Hi there! I’m Helen Bailey, Fellow for Digital Curation and Preservation at MIT Libraries, and this is my story.

In 2012, MIT Libraries decided to initiate a new fellowship program to offer early-career library professionals an opportunity to gain experience and contribute to initiatives in an academic research library. They selected two program areas to house these fellows: Scholarly Publishing and Licensing, and Digital Curation and Preservation (that’s me!). In October of 2012, I moved to Cambridge and started on this new adventure.

My position is hosted by the Curation and Preservation Services department of the Libraries, which is responsible for curation and preservation of the Libraries’ collections in all formats, from paper-based books to electronic resources and everything in between. I’ll be working on a variety of projects focused mostly on the digital side of things, although some of my work (like Preservation Week planning, more on that soon) encompasses curation and preservation as a whole.

One of my goals for the fellowship is to share what I learn from this experience with others: the MIT community, the library and information community, my digital preservation/curation colleagues, and the world at large. This is  a pretty unique opportunity to explore new ideas in an emerging and fast-paced area of librarianship, in a setting that houses some of the most innovative new technology developments in the world. Surely there are others who may be interested in hearing what the journey is like?

So, there you have it. This blog will chronicle my experiences in the two-year fellowship at MIT Libraries. I’ll talk about the things I’m learning, the projects I’m working on, the fun adventures I get to have on the MIT campus, and the post-graduate fellowship experience in general. If there are other things you’d like to hear about, please let me know!