LaTeX and Theses and Glyphs, Oh My! Part 1 of the PDF Experiment

One of these days I’m going to write some more detailed posts about other projects I’ve mentioned in the past…but not today. Today I’m going to introduce yet another project I’m working on, although one that is about to wrap up. The project is an analysis of text encoding errors in PDF theses submitted to the Libraries for deposit into our DSpace@MIT repository. It’s just one component of a larger set of life cycle experiments currently going on as part of the Digital Content Management Initiative. There will be much more to share about the life cycle experiments over the next year, so more posts to come on that.

In the meantime, a little background on the PDF theses: graduating students here submit paper theses as the official copy of record for the Institute, which are accessioned into the Institute Archives for preservation. We scan the theses and provide electronic access to them via DSpace@MIT. However, if students wish to provide an electronic version of their theses in PDF format, in addition to the required paper copy, they may do so. After some checking to make sure the electronic version matches the official print copy, we can use the e-submission as the online access copy, thus saving us the trouble of scanning it from print.

All well and good, however our meticulous catalogers have noticed some interesting problems appearing in a few of the electronically-submitted PDF theses (locally known as e-theses) over time. Often, these problems were encountered when attempting to copy and paste the text of a thesis abstract from the PDF document into another program, such as the library catalog. Certain character combinations tended not to copy over so well; they would paste into the new program as symbols or blank spaces instead of the expected letters.

Many of the catalogers observed that these characters were often ligatures in the original text, including such common letter combinations as “ff”, “fi”, and “th”. It was unclear what, exactly, was causing this to happen, but there was a suspicion that this problem stemmed from PDFs created by LaTeX applications. For those not familiar with LaTeX (I wasn’t, when I started this project) it’s a suite of typesetting tools built on the TeX typesetting system. This system is often used to write academic papers in the scientific and mathematical disciplines because it can handle typesetting of complex math formulas more easily than word processing programs. Consequently, it is very popular here at MIT.

My job on this project, in coordination with my colleague Rich Wenger, is to identify and characterize the types of problems encountered in our PDF theses, identify the source of these problems when possible, and recommend solutions for avoiding such problems in future e-thesis submissions. We’ve been working on the analysis for several months now, and while we aren’t quite done yet, we have learned a lot about PDF documents, TeX/LaTeX, and the challenges of representing text in a visual format. In part 2, I’ll discuss the process we went through to analyze the PDF documents and develop methods for characterize the text issues we encountered. Stay tuned!

Building an Ontology, or: Don’t Put the Cart Before the Horse

I got so excited about creating a linked data set that I tried to jump right into populating a triple store…whoops! As I was reading about the various triple store systems and serializations of RDF, it became clear that, even though this is intended to be an iterative process, the data was not yet up to snuff for even an initial conversion to RDF. I had basically just copied all the information straight from our bucket diagrams into an Excel spreadsheet without really thinking about how the data should be modeled. Here’s a sample from the first pass at data-gathering:

Initial Data from Bucket Diagrams as a Spreadsheet

Initial Data from Bucket Diagrams as a Spreadsheet

You can see how I tried to go about creating columns for each attribute of a given bucket (What content type does it belong in? Does it included born-digital content? How about digitized?). There are many, many more columns not shown in this picture, each representing some aspect of the conventions in our diagrams (there’s a sample diagram here if you missed that post or forgot what they look like). The diagrams, which are inherently simple visual representations, couldn’t be too detailed or they would have quickly gotten very cluttered. In the data, however, we’re free to be much more specific and, most importantly, flexible. For example, rather than listing born-digital and digitized as two separate properties with yes/no statements, we can have a field for “creation method” that may include born-digital, digitized, and/or some additional yet-to-be-determined method.

Realizing that I needed a better way to approach this process, I backtracked a few steps to the data modeling phase, which led me to some Google searching for ontologies and ontology creation. The results were not pretty, but through a lot of twists and turns I ended up at Stanford’s Protégé website, which not only develops and hosts the fantastic Protégé software (and web service!) for ontology development, but also provides excellent tutorials on creating and working with ontologies. Jackpot!

So without further ado, I started reading through their Ontology Development 101 tutorial, which is written as though the reader has no prior ontology-creation experience. Just my speed. By following the process they outlined, I’m well on my way to creating a class hierarchy for a digital content management ontology, to be shared in future posts. I also found several existing ontologies that I can pull from…no need to reinvent the wheel. It looks like there are definitely terms for a lot of the classes, entities, and properties we need, but there is also information we’re modeling that doesn’t seem to exist in a current ontology. At least not that I could find. Here are the most useful related ontologies I’ve found so far. If you know of others I should look at, please share in the comments!

Following the Protégé tutorial’s step-by-step process has been incredibly useful. It very clearly describes how to break down the information in terms of entities with properties and relationships to each other, as opposed to a flat table. In my next couple of posts I’ll talk about the classes that have emerged to describe our digital content, and some of the initial properties we’re interested in recording.

Digital Content Management Initiative: It’s All About the Data

One of the major projects I’m working on in the fellowship is the Libraries’ Digital Content Management initiative, or DCM for short. This project started as a library-wide initiative in fiscal year 2013, focused on reviewing the MIT Libraries’ digital content to ensure that we are able and ready to manage it over time. The digital content review process we followed was developed and led by Nancy McGovern, Head of Curation and Preservation Services, and coordinated by a Life Cycle Group representing several departments across the Libraries. There’s more information about the DCM initiative on our Life Cycle Management libguide, and we’ll be posting updates on the process and results of year one soon. In the meantime, I wanted to share some work I’m doing as part of the next phase of the project now that FY13 is over.

The first part of the digital content review involved completing overviews of nine different content types managed by the Libraries: architectural and design documentation, digital audio, digital video, e-books, faculty archives, geospatial information, research data, theses, and web content. We also identified three additional content types to review in the coming year: institute records, scholarly record, and visual resources. In the process of creating these overviews, we gathered a lot of information about each content type. For the first phase of the project, that information was compiled into written overview documents and visualized as diagrams showing each of the buckets, or categories of items, within a content type. Here’s an example for our Architectural and Design Documentation content type:

Architecture and Design Bucket Diagram

Architectural and Design Documentation Bucket Diagram

These diagrams and overview reports have been very helpful for us to get a birds-eye view of all the digital content the Libraries manage (or may manage in the future). For the next phase of the project, we want to take all this information and turn it into a data set that can be queried, visualized, and added to as it changes over time. As you can see, there’s a lot of data already there; for example we have information on who currently manages the content, how much content we have,  whether it’s born-digital or digitized, and much more in the corresponding overview document.

Before we can create a data set from all this information, however, we have to decide how we want to model the data. I’ve been reading a lot about linked data and RDF (short for Resource Description Framework) over the past few years, and this data seems like it might be a good fit for the graph model for a few reasons:

  1. we need flexibility, as we don’t know yet how much more data we’ll gather or what shape it will take as we go
  2. it will allow us to easily link to existing digital preservation standards and services such as the PREMIS Preservation Metadata Ontology and the Unified Digital Format Registry (UDFR)
  3. one of our primary interests is exploring the relationships between the content types, a task for which linked data is well-suited
  4. using a web-based method for managing this information could simplify future sharing and data exchange

So the next steps in the DCM initiative present an opportunity for me to kill two birds with one stone: convert the qualitative visual and textual information we’ve compiled into a data set, and learn the process of creating and working with a triple store. This is definitely an exploratory undertaking. It may turn out that a simpler model, even a very basic relational database, will be better suited to this particular set of data. And I will definitely have to stretch my technology skills to learn how to create and query a triple store using whatever documentation I can find. But this is part of the joy of being a fellow: exploration and learning new skills are all part of the deal!

I’m going to share all the details of the process as I go, so feel free to follow along on the blog, make suggestions when I get stuck, and/or laugh (in a friendly way, I hope) when I make silly newbie mistakes. It’s going to be a very interesting project.

It’s Conference Season!

Conference season (aka summer) is upon us, and May was the big month of professional development for me. It started off with a regional National Digital Stewardship Alliance meeting here in Boston, followed by the DigCCurr Professional Instistute in Chapel Hill, NC, and ending with the American Institute for Conservation Annual Meeting in Indianapolis! Here’s a brief summary of each one:

The NDSA meeting was a one-day unconference-style event held at the beautiful WGBH studios in Brighton. A group of digital preservation specialists and interested colleagues from around the New England area got together to talk about the challenges we’re facing and identify ways we can collaborate to solve some of our problems. We started off with short presentations on new initiatives and local efforts in the morning and then split up into groups for discussions in the afternoon. Discussion topics included marketing and outreach to the community, preserving research data, staffing and skills for digital preservation, and digital forensics. The day’s agenda and notes from our afternoon discussions are posted here. We hope to have more of these in the future, so if you’re in the New England area, stay tuned for info on those. Thanks so much to WGBH and Harvard Library for organizing the first one!

The DigCCurr Professional Institute was a week-long workshop on digital curation held at the lovely UNC Chapel Hill campus. This intensive week of training included presentations by an excellent set of instructors, practical labs where we could put the lessons learned into action, and lots of opportunity for conversations with the other digital library practitioners in attendance. We had a great cohort of folks from libraries all over the U.S. and Canada, and it was both a valuable learning experience and a rollicking good time! The week culminated in each attendee selecting and planning a project to complete at his or her home institution. We’ll all return to Chapel Hill in January to report back on our projects and share updates about how we implemented strategies learned in the first session. My six-month project is a review of the preservation metadata in the MIT Libraries DSpace@MIT repository, to clarify and improve our alignment with PREMIS. I’ll be working on that a lot between now and January, so expect updates on the blog!

The final week of May took me to Indiana for the AIC Annual Meeting. I’ve been to AIC several times before, but this year was my first time attending as Chair of the Electronic Media Specialty Group, which meant I was much more involved than in years past. Perhaps I’m a bit biased, but I thought the EMG sessions this year were truly excellent! Some of the highlights included talks about mass migration of media, conserving custom electronic video art equipment, and a new topic of interest for me: documenting source code. Of course, now that the Annual Meeting is over we’re already hard at work planning next year’s sessions. The theme is “sustainable choices for collection care,” which should be very interesting to explore in relation to digital preservation.

I’m so grateful to the MIT Libraries for supporting me in attending each of these events. These experiences have already contributed much to my thinking, project planning, and research interests for the rest of the fellowship and beyond. I can guarantee there will be more posts on some of these threads in the near future!

Preservation Week: A Recap

Preservation Week, in case you aren’t familiar with it, is the time of year when preservation departments across the country emerge from their basement labs and take over the libraries! Ok, that might be a bit of an exaggeration. In reality, Preservation Week was started by the American Library Association in 2010 so libraries and archives could highlight the wonderful work they do to preserve our nation’s cultural heritage. I’ve been involved in planning Preservation Week events since year one, so when Nancy asked me to coordinate this year’s activities for the MIT Libraries, I said, “sure, no sweat!”

Well, maybe a little sweat, but it was made much easier by Ann Marie Willer and Andrew Haggerty from Curation and Preservation Services, who put in extra effort to help me get everything together. This year, we wanted to highlight the three different areas that CPS is responsible for: digital curation, analog preservation, and “hybrid” work that cross the border between the two (which often involves reformatting analog collections into digital). We also wanted to step outside our own department and collaborate with folks in other areas of the MIT Libraries, as well as preservation folks outside our organization. We pulled together a great lineup of events that included all those things, and you can read the details about each of our events here.

What I want to talk about, though, is the process of putting a week like this together. How did we get from “let’s celebrate Preservation Week” to “we have four events planned, advertised, and ready to go”? It’s definitely a big job, but it helps to break it down into steps. We started with brainstorming possible ideas for events, an effort to which the entire CPS department contributed. Then a small group of us went through the loooong list and prioritized, noting which events we were most excited about, which events could we realistically pull together in a few short months, and which events simply weren’t going to happen this year. Once we had an outline of what we wanted to do, then we started contacting the folks we hoped to collaborate with to make the events happen. This year, that included several MIT librarians, as well as MIT senior Shannon Taylor and Conservation Scientist Katherine Eremin from Harvard’s Straus Center for Conservation and Technical Studies.

After all the speakers very kindly agreed to participate and we coordinated the dates and times so there weren’t overlapping events (not easy, since we were also keeping the libraries’ other planned events in mind) the rest was all logistics. We scheduled rooms, set up a webpage, wrote marketing blurbs, gathered pictures and information about our presenters, put up posters, and sent out information to listservs. I have to put in a big plug for the Libraries’ web team and marketing team here, because they were immensely helpful in getting our webpage up and coordinating marketing for all the events. In order to help us all keep track of the various little tasks we had to complete, I created a master task list for the group noting deadlines and responsible parties, which we could all access on a shared drive. We also helped each other stay on target with regular meetings to check in about what we’d done and what was left to do.

The week was a big success, with attendees at the events ranging from MIT students and staff to library professionals from other local institutions. We had a lot of fun, and all of our presenters were just amazing. There were, of course, a few small hiccups. One challenge was the unexpected overlap on Wednesday of Preservation Week with the memorial service for MIT Officer Sean Collier. Since we knew many people would not be able to attend our preservation week event due to this much larger event taking place on campus, we decided to hold a repeat session of that event two weeks later. Fortunately Kari Smith, our presenter for that day, was perfectly happy to do a second session. However, this unforeseen event was a reminder that you can’t plan for everything and it’s important to stay flexible.

I want to say a quick thank you to all our excellent Preservation Week speakers: Katherine Eremin, Peter Munstedt, Kari Smith, Shannon Taylor, and Ann Marie Willer. And another thank you to all the members of Curation and Preservation Services, the Web Team, and the Marketing Team for their help in getting ready. We’re happy we were able to share some preservation joy, and we can’t wait to do it again next year!

Welcome to MIT!


Me in front of my new workplace: Hayden Library at MIT

Hi there! I’m Helen Bailey, Fellow for Digital Curation and Preservation at MIT Libraries, and this is my story.

In 2012, MIT Libraries decided to initiate a new fellowship program to offer early-career library professionals an opportunity to gain experience and contribute to initiatives in an academic research library. They selected two program areas to house these fellows: Scholarly Publishing and Licensing, and Digital Curation and Preservation (that’s me!). In October of 2012, I moved to Cambridge and started on this new adventure.

My position is hosted by the Curation and Preservation Services department of the Libraries, which is responsible for curation and preservation of the Libraries’ collections in all formats, from paper-based books to electronic resources and everything in between. I’ll be working on a variety of projects focused mostly on the digital side of things, although some of my work (like Preservation Week planning, more on that soon) encompasses curation and preservation as a whole.

One of my goals for the fellowship is to share what I learn from this experience with others: the MIT community, the library and information community, my digital preservation/curation colleagues, and the world at large. This is  a pretty unique opportunity to explore new ideas in an emerging and fast-paced area of librarianship, in a setting that houses some of the most innovative new technology developments in the world. Surely there are others who may be interested in hearing what the journey is like?

So, there you have it. This blog will chronicle my experiences in the two-year fellowship at MIT Libraries. I’ll talk about the things I’m learning, the projects I’m working on, the fun adventures I get to have on the MIT campus, and the post-graduate fellowship experience in general. If there are other things you’d like to hear about, please let me know!


Get every new post delivered to your Inbox.