OCR technology
OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).
Typescript testing
Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.
The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):
At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:
What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:
And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:
The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies).
Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):
There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).
There are a few things we could do to improve accuracy:
- Incorporate medical dictionaries to improve recognition and confidence of scientific terms
- Enhance images to "remove" any annotations prior to OCR'ing
- Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)
These digital collections are not yet available online, but will be accessible from autumn 2012.
Thanks for reporting on this interesting experiment!
ReplyDeleteFor those interested: such testing is also offered as a service through the IMPACT Centre of Competence (http://www.digitisation.eu/), in which PRImA take part as well.
I'd be interested to know why FineReader was chosen over other products, such as omnipage for example, and whether any work was done to compare FineReader to rivals other than Tesseract.
ReplyDeleteHi Christy,
ReplyDeleteThanks for sharing the results of your testing. We are also using FineReader, with accuracy varying with the characteristics of the page as you report.
I'm curious as to how you will handle the handwritten items in your image set. Do you plan to transcribe some or all of them for indexing purposes or is that prospect simply not practical?
Regards,
John
http://collections.nlm.nih.gov
Thank you for all the comments.
ReplyDeleteRobkirk, FineReader was selected as the typical OCR engine used by most suppliers; it and Tesseract are also the two the PRImA team have most experience with. Having the groundtruth means that we could of course compare other OCR technologies to these in future if required.
John, I'm glad you found the report interesting! We are not planning to transcribe the handwritten content - we are looking at hundreds of thousands of handwritten pages, and our budget just does't stretch to that! However we are keeping our eye on crowdsourcing developments, in case this becomes a feasible option in future.
Nice article, thanks for the information.
ReplyDeleteAnna @ rental mobil
Would it be an idea to step right back and compare the time and costings for manual transcription with those of staffing and running an OCR unit? The Thesaurus Linguae Graecae project, that has been creating machine-readable texts since 1972, has regularly re-evaluated OCR software throughout its history. I believe that they still out-source text conversion to double-key data entry companies because it is cheaper, faster, and most important of all, more accurate.
ReplyDeleteHi Dominik,
ReplyDeleteOCR'ing will always cost less than re-keying, especially if you can do it in bulk, and as we are getting better-than expected results, we are likely to dispense with manual correction too (which increases the cost of OCR considerably of course).
Double-rekeying can be more accurate, but I have never seen a scenario where it is cheaper.
I enjoyed reading it. I require to study more on this topic. Thanks for sharing a nice info..Any way I'm going to subscribe for your feed and I hope you post again soon.
ReplyDeleteTony@Jakarta Hotel
Two pass verification, also called double data entry, it is nothing but data entry quality control method that was originally employed when data records were entered onto sequential 80 column Hollerith cards with a keypunch. In the first pass through a set of records, the data keystrokes were entered onto each card as the data entry operator typed them. On the second pass through the batch, an operator at a separate machine, called a verifier, entered the same data. The verifier compared the second operator's keystrokes with the contents of the original card. If there were no differences, a verification notch was punched on the right edge of the card.
ReplyDeleteInformatics Outsourcing is an Offshore Data Management service company. Data Management Service includes all types of Data Entry Services like Text Data Entry, Numerical Data Entry, Double Key Data Entry, Image Data Entry with affordable price. Our team to give the solution quickly and given requirements.
great works. this blog digital library my inspiration. best regards :)
ReplyDeleteRumah Dijual Jakarta
thanks for share... visit me back
ReplyDeletegrosir pakaian dalam murah
jual cincin perak
bibit lengkeng
jual bibit durian
crystal x asli
This excellent article assisted me very much! Saved your site, extremely great categories everywhere that I read here! I really appreciate the info.
ReplyDeleteCara Pemakaian Crystal X yang Benar
Penyakit Kewanitaan yang Sembuh dengan Crystal X
Mengupas Manfaat dan Bahaya Crystal X
Ketahuilah Tingat Keasaman Miss V Mulai Saat Ini
In the first pass through a set of records, the data keystrokes were entered onto each card as the data entry operator typed them
ReplyDeleteCINCIN KAWIN
I never read this type of article before. I appreciate you for the article you have written. Thanks.
ReplyDeleteEcommerce website in Birmingham
Website Development Company in Birmingham
I know where I'm going and l know the truth, and I don't have to be what you want me to be. I'm free to be what I want.Thankyou i really love it.....
ReplyDeleteI find a free online ocr, supports 40+ languages, can convert image to plain txt file and searchable pdf document.
ReplyDeleteWow, incredible blog layout! you make blogging look easy. The overall look
ReplyDeleteof your web site is great.
Muvee Reveal 11 Serial Key + Crack
Disk Drill Pro Serial Number for Mac
Great info! I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.
ReplyDeleteCrystal X Merapatkan Vagina Kembali Seperti Perawan
Merawat Kesehatan Vagina
Jual Crystal X Obat Keputihan Herbal
Mencegah Miom Dan Kista Crystal X Agar Cepat Hamil
Pencerah Wajah Herbal Collaskin NASA
Thanks for sharing your knowledge to install & crack the Time Tables, but you need to update it now.
ReplyDeleteBecause there is a 2024 version available now.
abbyy-finereader