One of these days I’m going to write some more detailed posts about other projects I’ve mentioned in the past…but not today. Today I’m going to introduce yet another project I’m working on, although one that is about to wrap up. The project is an analysis of text encoding errors in PDF theses submitted to the Libraries for deposit into our DSpace@MIT repository. It’s just one component of a larger set of life cycle experiments currently going on as part of the Digital Content Management Initiative. There will be much more to share about the life cycle experiments over the next year, so more posts to come on that.
In the meantime, a little background on the PDF theses: graduating students here submit paper theses as the official copy of record for the Institute, which are accessioned into the Institute Archives for preservation. We scan the theses and provide electronic access to them via DSpace@MIT. However, if students wish to provide an electronic version of their theses in PDF format, in addition to the required paper copy, they may do so. After some checking to make sure the electronic version matches the official print copy, we can use the e-submission as the online access copy, thus saving us the trouble of scanning it from print.
All well and good, however our meticulous catalogers have noticed some interesting problems appearing in a few of the electronically-submitted PDF theses (locally known as e-theses) over time. Often, these problems were encountered when attempting to copy and paste the text of a thesis abstract from the PDF document into another program, such as the library catalog. Certain character combinations tended not to copy over so well; they would paste into the new program as symbols or blank spaces instead of the expected letters.
Many of the catalogers observed that these characters were often ligatures in the original text, including such common letter combinations as “ff”, “fi”, and “th”. It was unclear what, exactly, was causing this to happen, but there was a suspicion that this problem stemmed from PDFs created by LaTeX applications. For those not familiar with LaTeX (I wasn’t, when I started this project) it’s a suite of typesetting tools built on the TeX typesetting system. This system is often used to write academic papers in the scientific and mathematical disciplines because it can handle typesetting of complex math formulas more easily than word processing programs. Consequently, it is very popular here at MIT.
My job on this project, in coordination with my colleague Rich Wenger, is to identify and characterize the types of problems encountered in our PDF theses, identify the source of these problems when possible, and recommend solutions for avoiding such problems in future e-thesis submissions. We’ve been working on the analysis for several months now, and while we aren’t quite done yet, we have learned a lot about PDF documents, TeX/LaTeX, and the challenges of representing text in a visual format. In part 2, I’ll discuss the process we went through to analyze the PDF documents and develop methods for characterize the text issues we encountered. Stay tuned!