A common hurdle facing digitization projects is producing accurate optical character recognition (OCR) output. Library and archives users expect to be able to search the full text of digitized documents, and yet OCR technology—though extremely useful when it’s accurate—is far from perfect.
What conditions make for excellent OCR? Typically OCR excels when used on a book or similarly formatted text. The text reads from left to right, is free of illustrations, contains little negative space, and is printed in a consistent font type and size.
Historic newspapers commonly suffer from poor OCR outputs because OCR engines don’t like their column-style layouts. Advertisements also cause difficulties because they may contain illustrations; changes in font type, size, and orientation; and lots of negative space.
Upon digitizing our first reel of microfilm last spring, the Histroic Maryland Newspapers Project had to deal with a few challenges regarding OCR:
- As explained in my first blog post, National Digital Newspaper Program (NDNP) projects digitize from microfilm, specifically from a second-generation silver negative copy that is made from the camera master. When undertaking any sort of digitization project, it’s always ideal to scan the original source materials, or as close to them, as possible. This produces the clearest possible images and in turn the best possible OCR.
- The first newspaper our project is digitizing is called Der Deutsche Correspondent, and as you may have guessed, it is a German-language newspaper. It was printed in a script called Fraktur, which was common to other German papers of the time. Fraktur looks like calligraphy, and several letters are easy to confuse. Fortunately, ABBYY began to develop OCR technology for Fraktur in 2003. Without OCR software designed specifically for the Fraktur typeface, it would have been impossible to produce OCR output for Der Deutsche Correspondent that was anything but garbled.
- The condition of the Der Deutsche Correspondent pages that were microfilmed varies from very good to very not good. In our first reel, faded print was a common problem. Faded print results in low contrast, which makes it difficult for the OCR engine to interpret what remains of the printed characters.
- I, the project manager, cannot read, write, or speak German. How is it possible for me to evaluate the quality of our OCR output? The presence of odd characters is the biggest red flag. Compare this text:
ksM. >->’ lchn-a-» »-“»»»«»» . ‘
Der deutsche Dichter Albert Träger,
der wieder für den Reichstag „läuft,” hat
folgenden Mahnruf an feine Wähler gerichtet:
The Library of Congress (LC) does not have a defined standard for how accurate the OCR output must be because it can vary so widely (due to all the factors discussed above). It’s more important that the OCR output be as good as it can be, given the condition of the microfilmed paper. We are also asked to check that the OCR is zoned by columns and appears in reading order, that the language has been correctly encoded, and that search terms are correctly highlighted. All of these requirements can be met without knowing the language of the newspaper, although it would certainly make the work easier and more enjoyable!
While it was clear that our OCR accuracy was not going to be 100%, our main concern was making sure that it was good enough for LC, as they had to approve our first reel before we could continue full-steam ahead with the project. We worked with our digitization vendor to deskew some images to the text block edges, rather than to the page edges, to improve OCR quality—but to be honest it still wasn’t great. Luckily, LC agreed that the quality of the OCR, while not good, was as good as it could be given the condition of the filmed newspapers.
Some may wonder why we would go ahead with digitizing from microfilm knowing that the OCR accuracy would suffer. Although the full-text search won’t work very well for the images on our first reel, the pages will still be online and accessible to researchers, genealogists, students, and teachers. They will now have the added advantage of being able to limit their searches using metadata that is not reliant on the OCR, such as publication location, newspaper title, date range, and language. And we are happy to report that the quality of newspaper pages on subsequent reels has improved and OCR accuracy along with it.
While not a perfect technology, OCR output—along with all the other metadata created by NDNP projects—is making our country’s aging newspapers more accessible than they have ever been before.