Wednesday, July 4, 2012

OCRing typescript: A benchmarking test with PRImA

During our pilot phase, we plan to add over 1 million images of archives to the Wellcome Library website, all created during the 20th century. Much of this is comprised of typescript: letters, drafts of papers and articles, research notes, memos, and so on. Key to improving discoverability of these papers is OCR (optical character recognition), which allows us to encode the words as text and include them in a full-text index. We set out to test whether typescript material would provide accurate OCR results that could be included in our index.

OCR technology

OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).

When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).

Typescript testing

Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and  Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).

Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.

The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):

At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:

What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:

And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:

The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies)

Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):

There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).

There are a few things we could do to improve accuracy: 
  1. Incorporate medical dictionaries to improve recognition and confidence of scientific terms
  2. Enhance images  to "remove" any annotations prior to OCR'ing
  3. Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)
We may find that an average of 83% word accuracy overall is perfectly adequate for our needs in terms of indexing terms and allowing people to discover content efficiently. Further investigation is required, but this report has given us a good foundation from which to press on with our OCR'ing plans.

These digital collections are not yet available online, but will be accessible from autumn 2012.