PDF Analysis Tools: Part 2 of the PDF Experiment

A while back I posted about the PDF experiment I’m doing to explore some character encoding anomalies in our PDF e-theses. The project is complete now, but I wanted to outline some of the steps we went through before I jump right to what we found. Step one was analyzing the PDFs to figure out what, exactly, was going on underneath the hood. But in order to do that we actually needed a preliminary step, which was finding a tool that would help us do that analysis effectively.

This actually turned out to be more difficult than it seemed at first. The errors we were looking for aren’t related to the common issues digital preservationists are most often concerned with, such as format identification and validation. Tools like JHOVE weren’t quite right for our purposes, because the errors we encountered were all deep within the content stream. We had plenty of perfectly valid PDFs (checked against JHOVE just to be sure) that would render fine in a PDF reader, but the text came out with unusual characters if copied and pasted into another document (any other document, we tried all the text editors we could find).

So our challenge was to find a tool that would help us analyze the text encoding within the content stream. It turned out Adobe Acrobat Pro was the most useful tool for manually inspecting the errors, since its preflight function allowed us to view the actual content stream encoding at the point in the document where we had a known error, such as a ligature that wouldn’t convert to its correct ASCII characters on copy-and-paste. However, this required that we already know the PDF had said error, and what we really wanted was a tool that would help us identify PDFs with those errors from a larger collection of PDF e-theses.

We tested 16 tools to see if any would solve this problem for us. Questions we asked as we tested each tool included:

  • Can the tool accurately and reliably diagnose the problems we are working on?
  • Does it have repair capabilities, or is it diagnostic-only?
  • Can it be run as a batch process (from the command line) or does it require a GUI?
  • Does it run on Linux, Windows, Mac?
  • Can it report on whether the document meets archival standards like PDF/X, PDF/A etc.?
  • Who creates and maintains the tool?
  • Is there good documentation/support?
  • Is there a cost?
  • Is it open source?

We were hoping to not only find a tool to suit our needs, but one that was free and open source if possible. For each tool, we went through the following steps for initial testing:

  1. Install the tool on local machine or server.
  2. Run the tool on each of 32 sample PDF theses that represent the different types of problems we had encountered.
  3. Record results/output and comments about the tool’s functions in a shared spreadsheet.
  4. After testing the tool and getting a feel for how it works, record general observations and answers to the questions above on a wiki page for the project.

After going through this process for all 16 tools, we hadn’t found a tool that did exactly what we needed. But we found that there were two tools whose text extraction features could be used to serve our purposes. One, PDFBox, was able to work with the ligature issue for most of the PDFs, providing cleanly extracted text that resolved to its correct ASCII characters. This was helpful in two ways: one, it led us to believe that the issue we were dealing with was also one of rendering rather than strictly encoding (whatever the encoding is for those ligatures, it can be converted correctly by at least one program); and two, it gave us an option for extracting readable text for both cataloging and indexing in our repository.

The other tool, PDFMiner, helped us by, in a sense, not working: on extracting the text, every single PDF that we had identified as one with ligature issues showed those ligature issues in the extracted text. Therefore, we decided that whatever the encoding is that’s causing the problem, PDFMiner is not able to resolve it to the correct ASCII characters. While this might sound like a deficiency in the tool, for our needs it actually allowed us to create the exact tool we were looking for!

My colleague Rich Wenger wrote a series of scripts that will extract text from a PDF using PDFMiner and then search for the anomalous characters that appear when there are known text encoding issues present. This allowed us to run his scripts over the entire collection of e-thesis PDFs to see how many have underlying text encoding issues. After running these scripts, we determined that about 30% of the collection presents the ligature encoding error, and a much smaller percentage presented other errors such as excessive line breaks and extra characters inserted in the encoded text. In my next post, I’ll discuss possible implications of these results and some of the steps we’re taking to mitigate these issues in the future. Stay tuned!

LaTeX and Theses and Glyphs, Oh My! Part 1 of the PDF Experiment

One of these days I’m going to write some more detailed posts about other projects I’ve mentioned in the past…but not today. Today I’m going to introduce yet another project I’m working on, although one that is about to wrap up. The project is an analysis of text encoding errors in PDF theses submitted to the Libraries for deposit into our DSpace@MIT repository. It’s just one component of a larger set of life cycle experiments currently going on as part of the Digital Content Management Initiative. There will be much more to share about the life cycle experiments over the next year, so more posts to come on that.

In the meantime, a little background on the PDF theses: graduating students here submit paper theses as the official copy of record for the Institute, which are accessioned into the Institute Archives for preservation. We scan the theses and provide electronic access to them via DSpace@MIT. However, if students wish to provide an electronic version of their theses in PDF format, in addition to the required paper copy, they may do so. After some checking to make sure the electronic version matches the official print copy, we can use the e-submission as the online access copy, thus saving us the trouble of scanning it from print.

All well and good, however our meticulous catalogers have noticed some interesting problems appearing in a few of the electronically-submitted PDF theses (locally known as e-theses) over time. Often, these problems were encountered when attempting to copy and paste the text of a thesis abstract from the PDF document into another program, such as the library catalog. Certain character combinations tended not to copy over so well; they would paste into the new program as symbols or blank spaces instead of the expected letters.

Many of the catalogers observed that these characters were often ligatures in the original text, including such common letter combinations as “ff”, “fi”, and “th”. It was unclear what, exactly, was causing this to happen, but there was a suspicion that this problem stemmed from PDFs created by LaTeX applications. For those not familiar with LaTeX (I wasn’t, when I started this project) it’s a suite of typesetting tools built on the TeX typesetting system. This system is often used to write academic papers in the scientific and mathematical disciplines because it can handle typesetting of complex math formulas more easily than word processing programs. Consequently, it is very popular here at MIT.

My job on this project, in coordination with my colleague Rich Wenger, is to identify and characterize the types of problems encountered in our PDF theses, identify the source of these problems when possible, and recommend solutions for avoiding such problems in future e-thesis submissions. We’ve been working on the analysis for several months now, and while we aren’t quite done yet, we have learned a lot about PDF documents, TeX/LaTeX, and the challenges of representing text in a visual format. In part 2, I’ll discuss the process we went through to analyze the PDF documents and develop methods for characterize the text issues we encountered. Stay tuned!