Friday, October 19, 2012

Digesting Ingest

Alcohol metabolism
Harkiran Dhindsa, and Rioghnach Ahern, Digital Ingest Officers working on the Wellcome Digital Library, describe their experiences using Goobi on a daily basis, and some of the lessons learnt as we scaled up into full production over this past summer:

Goobi is a workflow-based management system that allows us to track and manage the workflows for various digitisation projects, be that archives, books, film or audio files. Many steps are fully automated, including the conversion of TIFF to JPEG2000 and the ingest of content into our repository Safety Deposit Box.

We found the user interface of Goobi to be intuitive. Training in basic ingest processes was quick. A number of the team are using this system. With regular usage, we were working efficiently and became familiar with the functionality. METS editing is facilitated through a web form which allows JPEG images of individual pages to be viewed. Using such a system eliminates the need to keep separate spread-sheets. Because of the way Goobi tracks the workflow by registering each step, it means different staff can continue with tasks at any open step. At any point if an error is noticed - for example a missing image in a book - a correction message can be sent back along the workflow to the appropriate person.

Goobi produces METS files, which describe objects including their access and license status. Although Goobi writes the METS files, the structure of an object is created manually, depending on the project. Much of our time is spent working on METS editing, particularly in adding restrictions for material which contains sensitive data. Goobi can handle a number of projects at the same time, so we can easily switch between working on archives and books. It can handle different tasks simultaneously. For example, an ingest officer can let an image upload task run for, let’s say, an archive collection, while continuing to edit METS data on books.

Lessons learnt

As daily users of Goobi, here are some of the lessons we have learnt:

Prior to import into Goobi, catalogued items are photographed and then the digitised images are checked for data sensitivity. In the early stages of the project, areas that could be improved for a more accurate and efficient workflow became obvious. Amongst one of the first archival collections to be digitised, some of the images that were available as a backlog, and were due to be uploaded into Goobi, did not reflect the archive catalogue (CALM). This was because changes had been made to the catalogue after photography. The lesson learnt from this experience is that photography should only be carried out after cataloguing has been truly completed so that the arrangement of material is firmly established.

To upload images into Goobi, they are first copied from a working network directory to a temporary drive created by Goobi for the user who has accepted the upload task. This process can be terminated by other activities if the network to the local PC is running at full capacity. When this happens we have to redo the transfer, taking extra time to complete the task. Thankfully, this image upload task will be automated in the future, bypassing the local PCs completely. However, the running of several tasks simultaneously will still be limited by server capacity when uploading large files.

After METS editing was completed on one of the archive collections, we were given further sensitivity data. To add these new sensitivity restrictions, we had to “roll back” processes that had already been ingested, thereby re-running part of the workflow. It is very easy to prompt a second ingest into the digital asset management system in the process, resulting in duplicated sets of files, as the roll-back process is less intuitive and not intended for regular use. Again, we have learned an important lesson. It will always be necessary to edit METS files. Changes to the workflow steps in Goobi to make this more straightforward would be useful, but it would be even better to finalise sensitivity lists before METS editing is completed in order to minimise duplication of effort.

A workflow system such as Goobi becomes imperative when ingesting mass collections of archives and books. Images that have gone through the complete ingestion process in Goobi will be accessed online via the Player. Seeing the images in an attractive interface is a satisfactory part of this work as this is where all of the different tasks come to fruition: the digitised archives and books available to the public to view in a user-friendly form — soon to be publicly available!

Authors: Harkiran Dhindsa and Rioghnach Ahern

Monday, August 13, 2012

Half a million milestone

We have recently completed processing half a million images in our workflow system (Goobi) ahead of making our new website available to the public later on this year. These images are the product of on-site digitisation of our archives and genetics book collections for the WDL pilot programme.

In previous posts on this blog we have talked about our storage system, server environment, our digital asset management system, and our image "player," but central to all of this is our workflow system, an enhanced and customised version of the open-source software Goobi developed originally at the University of Gรถttingen, and supported by Intranda.

Goobi is an extremely flexible database system that allows us to create and modify workflows (series of tasks both manual and automatic) for specific projects. These tasks can be recorded in Goobi (such as "Image capture"), done by Goobi (such as "Image conversion JPEG" for viewing images in the Goobi system itself), or can be initiated by Goobi (such as "SDB ingest" which triggers our digital asset management system - SDB - to ingest content).

We are currently working on ingesting images for three different workflows, described and simplified below.

Archive backlog
This workflow allows us to ingest all the images we created from archives since the project began in December 2009 up to May 2012 - around 320,000 of them - from eight collections. So far, we have finished processing around 250,000 images from this set.

The steps for each item to be ingested are as follows (automatic steps are in italics):

  1. Export MARC XML as a batch from the Library catalogue (per collection)
  2. Import metadata as a batch into Goobi
  3. Import JPEG2000 images one folder at a time from temporary storage to Goobi-managed directories
  4. Goobi converts JPEGs for viewing in the Goobi interface
  5. Check that images are correctly associated with the metadata
  6. Add access control "code" to items with sensitive material (restricted or closed)
  7. Trigger SDB ingest, passing along key information to enable this
  8. Import administrative/technical metadata from SDB after ingest
  9. Export METS files to be used by the "player"
Archives digitisation
This workflow deals with the current digitisation by tracking and supporting activities from the very beginning of the digitisation process. So far, we have imported very few images for this project, having just started using this workflow in earnest. The steps include:
  1. Export MARC XML as a batch from the Library catalogue (per collection)
  2. Import metadata as a batch into Goobi
  3. Group metadata into "batches" in Goobi for each archive box (usually 5 - 10 folders or "items")
  4. Track the preparation status at the box level (in process/completed) and record information for the next stage (Image capture)
  5. Track photography status at the box level and record information for the next stage (QA)
  6. Track QA at the box level and return any items to photography if re-work is required
  7. Import TIFF images via Lightroom (which converts RAW files and exportsTIFF files directly into Goobi)
  8. Convert TIFFs to JPEGs
  9. Check that images are correctly associated with the metadata 
  10. Add access control "code" to items with sensitive material (restricted or closed)
  11. Convert TIFFs to JPEG2000
  12. See 7-9 above
Genetics books
The other half our ingest effort has focused on the Genetics books. We have imported into Goobi over 250,000 images from this collection since digitisation began in February of this year. This workflow is very similar to Archives digitisation - containing steps related to the entire end-to-end process. The main differences being that as the digisation is being done by an on-site contractor, images are delivered to us as TIFFs, and while there are no sensitivity issues, there is metadata editing to add structure to aid navigation, and a range of "conditions of use" codes depending on the restrictions copyright holders request us to make.

  1. Export MARC XML as a batch from the Library catalogue (whole collection)
  2. Import metadata as a batch into Goobi
  3. Track preparation status at the book level and record information for next stage (Image capture)
  4. Track image capture status at the book level
  5. Track QA status (QA is done on the TIFFs supplied by the contractor)
  6. Import TIFF images one folder at a time from temporary storage to Goobi-managed directories
  7. Convert TIFFs to JPEGs
  8. Check that images are correctly associated with the metadata 
  9. Associate images with structural metadata (covers, titlepage, table of contents) thereby enabling navigation to these elements in the "player"
  10. Add page numbering 
  11. Add licence code to books that have use restrictions (such as no full download allowed) as per requests by copyright holders
  12. As above
We haven't got it all figured out yet
Other workflows we have not yet put into production, include born digital materials, A/V materials, items with multiple copies/volumes or "parts" (such as a video and its transcript), and manuscripts. We are looking at implementing new or different functionality in Goobi in the near future as well, including JPEG2000 validation using the Jpylyzer script, automated import of images, and configuring existing functionality in Goobi to support OCR and METS-ALTO files to name a few. These changes are aimed at minimising manual interaction with the material to save on time and improve accuracy. 

Wednesday, July 4, 2012

OCRing typescript: A benchmarking test with PRImA

During our pilot phase, we plan to add over 1 million images of archives to the Wellcome Library website, all created during the 20th century. Much of this is comprised of typescript: letters, drafts of papers and articles, research notes, memos, and so on. Key to improving discoverability of these papers is OCR (optical character recognition), which allows us to encode the words as text and include them in a full-text index. We set out to test whether typescript material would provide accurate OCR results that could be included in our index.

OCR technology

OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).

When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).

Typescript testing

Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and  Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).

Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.

The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):

At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:

What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:

And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:

The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies)

Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):

There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).

There are a few things we could do to improve accuracy: 
  1. Incorporate medical dictionaries to improve recognition and confidence of scientific terms
  2. Enhance images  to "remove" any annotations prior to OCR'ing
  3. Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)
We may find that an average of 83% word accuracy overall is perfectly adequate for our needs in terms of indexing terms and allowing people to discover content efficiently. Further investigation is required, but this report has given us a good foundation from which to press on with our OCR'ing plans.

These digital collections are not yet available online, but will be accessible from autumn 2012.

Tuesday, May 15, 2012

Serving servers: a technical infrastructure plan

As we aim to provide a fast, efficient and robust technical architecture for the Wellcome Digital Library, the Wellcome Trust IT department has been working closely with our software suppliers to specify a suitable server architecture. This work is still in progress, but we now have the skeleton idea of how many servers we are likely to need and for what purposes. The scale of the architecture requirements shows that setting up and delivering digital content is a significant undertaking.

In order to serve up millions of images, plus thousands of A/V files, born digital content and the web applications that make them accessible, we believe we’ll need around 17 (virtual) servers for the production environment (the “live” services), and a further 10 servers for our staging and development environments. In the production environment, nearly every server is duplicated to ensure redundancy and a smooth delivery service, which is why the numbers are so high. The content management system coupled with its SQL database requires four servers, for example. The image delivery environment needs six servers for data delivery, on-the-fly image conversion and tile creation, and media proxy servers creating digital content URLs that divorce the user-request mechanism from the actual content held on our servers for security reasons.

Most of the servers run on Windows 2008, although our image server (IIPImage) will run on Linux Ubuntu. The virtual servers share CPUs, but the number of CPUs available mean that each server gets the equivalent of either 2 or 4 CPUs, leading to a total 48 CPU requirement (288 cores as each CPU has 6 cores) . RAM varies from 2GB to 8Gb depending on the anticipated usage of a particular application on that server. The total RAM requirement for the production architecture is estimated at 124Gb. These specifications are currently our best guess, and will be tested in the weeks to come as we start to deploy the hardware.

The staging environment allows system upgrades, patches or new development work to be applied and tested  separately from the live production environment. This means that any changes can be tested thoroughly before changes are made publicly visible and/or usable. Actual development work is carried out in the development environment, before deployment for final testing on the staging servers. This means that applications such as the web content management system and the delivery system applications must be replicated in these two additional environments, along with their server requirements.

With thanks to David Martin, IT Project Manager, as the source of my information.

Friday, May 11, 2012

Developing a player for the Wellcome Digital Library

Previous posts here have covered the digitisation of books and archives and the storage of the resulting files (mostly JPEG2000 images, but some video and audio too). Now it’s time to figure out how visitors to the Wellcome Library site actually view these materials via a web browser.

The digitisation workflow ends with various files being saved to different Library back-end systems:

  • The METS file is a single XML document that describes the structure of the book or archive, providing metadata such as title and access conditions. 
  • Each page of the book (or image of an archive) is stored as a JPEG2000 file in the Library’s asset management system, Safety Deposit Box (SDB). Each image file in SDB has a unique filename (in fact a GUID), and this is referenced in the METS file. So given the METS file and access to the asset management system, we could retrieve the correct JPEG 2000 images in the correct order. 
  • Additional files might be created, such as METS-ALTO files containing information about the positions of individual words on a digitised page; we’ll want to use this information to highlight search results within the text. 
So how do we use these files to allow a site visitor to read a book?

Rendering JPEG 2000 files

Our first problem is that we can’t just serve up a JPEG2000 image to a web browser – the format is not supported. And even if it was, the archival JPEG2000 files are large: several megabytes each. The solution to this problem is familiar from services like Google Maps – we break the raw image up into web-friendly tiles and use them at different resolutions (zoom levels). When you use Google Maps, you can keep dragging the map around to explore pretty much anywhere on Earth – but your browser didn’t load one single enormous map of the world. Instead, the map is delivered to you as 256x256 pixel image files called tiles, and your browser only makes requests for those tiles that are needed to show the area of the map visible in your browser’s viewport. Each tile is quite small and hence very quick to download – here’s a Google map tile that shows the Wellcome Library:

Google Maps is a complex JavaScript application that causes your browser to load the right tiles at the right time (and in the right place). This keeps the user experience slick. We need that kind of user experience to view the pages of books.

There are several JavaScript libraries available that solve the difficult problem of handling the viewport and generating the correct tile requests in response to user pan and zoom activity. We’ve settled on Seadragon, because we really like the way it zooms smoothly (via alpha blending as you move from one zoom level’s tiles to another). A very nice existing example of this is at the Cambridge Digital Library’s Newton Papers project:

This site uses a viewer built around Seadragon; an individual tile looks like this:

The numbers on the end indicate that this jpeg tile is for zoom level 11, column 3, row 2. As you explore the image, your browser makes dozens, even hundreds of individual tile requests like this. It feels fast because each individual tile is tiny and downloads in no time.

For more about tiled zoomable images, these blog posts are an excellent introduction:

So how do we get from a single JPEG2000 image to hundreds (or even thousands) of JPG tiles? It’s possible to prepare your image tiles in advance, so that you process the source image once and store folders of prepared tiles on your web server. For small collections of images this is a simple way to go and doesn’t require anything special on the server. But for the Library, it’s not practical – we don’t want to prepare tiles as part of the digitisation workflow. They are not “archival”, and they take up a lot of extra storage space. We need something that can generate tiles on the fly from the source image, given the tile requests coming from the browser.

 For this we need an Image Server, and we’ve chosen IIPImage for its performance and native Seadragon (Deep Zoom) support. The Image Server generates browser-friendly JPEG images from regions of the source image at particular zoom levels. When your browser makes a request to the image server for a particular tile, the image server extracts the required region from the source JPEG 2000 file and serves it up to you an ordinary JPEG.

Viewer or Player? Or Reader? 

The next piece of the puzzle is the browser application that makes the requests to the server. A book or archive is a sequence of images along with a lot of other metadata. And it’s not just books – the Library also has video and audio content. All of these are described in detail by METS files produced during the digitisation/ingest workflow. In the world of tile-based imaging, the term “viewer” is often used to describe the browser component of the system, but we seem to have fallen naturally to using the term “Player” to describe it – it plays books, videos and audio, so “Player” it is. Our player needs to be given quite a lot of data to know what to play.

We could just expose the METS file directly, but it is large and complex and much of it is not required in the Player. So we’re developing an intermediate data format, which effectively acts as the public API of the Library. Given a Library catalogue number, the player requests a chunk of data from the server; this tells it everything it needs to know to play the work, in a much simpler format than the METS file. In the future other systems could make use of this API (at the moment it’s exposed as JSON).

The user experience 

The user won’t just be viewing a sequence of images, like a slide show. It should be a pleasant experience to read a book from cover to cover. Many users will be using a tablet, reading pages in portrait aspect ratio. We aim to make this a good e-reading experience too, augmented by search and navigation tools.

The user experience might start with a search result from the Library’s main search tool. For books that have been digitised, the results page will provide an additional link directly to the player “playing” the digitised book. The URL of the book is an important part of the user experience, and we want to keep it simple. In future, would be the URL of the work with catalogue refrence number b123456; that URL would take you straight to the player.

We want to be able to link directly to a particular page of a particular book, just as a printed citation could. This deeper URL would be /player/b123456#/35. But we can do better than that; our URL structure should extend to describe the precise region of a page, so that one reader could line up a particular section of text on a page, or a picture, and send the URL to another reader; the second reader would see the work open at the same page, and zoomed in on the same detail.

Access Control 

Much of the material being made available is still subject to copyright. Those works that are cleared for online publication by the Trust’s copyright clearance strategy still need some degree of access control applied to them; typically the user will be required to register before viewing them. This represents a significant architectural challenge, because we need to enforce access restrictions down to the level of individual tile requests. We don’t want anyone “scraping” protected content by making requests for the tiles directly, bypassing the player.

Performance and Scale 

As well as the technical challenges involved in building the Player, we need to ensure that content is served to the player quickly. Ultimately the system will need to scale to serve millions of different book pages. Between the player and the back end files is a significant middle tier: the Digital Delivery System, of which the Player is a client. This layer is the Library’s API for Digital Delivery. The browser-based player interacts with it to retrieve data to display a book, highlight search results, generate navigation and so on. The Image Server is a key component of this system.

This post was written by Tom Crane, Lead Developer at Digirati, working with his colleagues on developing digital library solutions for the Wellcome Digital Library.

Monday, April 30, 2012

Will more data lead to different histories being told?

New technology is making information more widely available and, when it launches later this year, the WDL will make it easier to access historical evidence about the foundations of modern genetics.  Will this democratize our understanding of the history of genetics and lead to different versions of the history being told?

There is an African proverb which says that history is written by the hunter not the lion.  History inevitably simplifies the past and the selection process can be subjective.  When it launches the WDL will start to put 21 archive collections and around 2,000 books on-line.  The project is to digitise as much as we can rather than cherry pick the highlights. This means that the building blocks used by historians to piece together the past will be made freely available to a wider audience.  A lot of this material may seem like mundane workaday stuff. Users will have to wade through a lot of material to reach the bits they are interested in but this is probably a more accurate reflection of the scientific research process.

Flashes of genius are essential but they do not happen in isolation. Thomas Edison’s phrase about invention being 1% inspiration and 99% perspiration applies to scientific research too. The discovery process needs both.

Watson and Crick were extremely clever to work out the helical structure of DNA but they did not get there simply because they were lone geniuses. Before they made their discovery a lot of people had spent years experimenting, writing and thinking about DNA. There had even been flashes of insight which ended up being wrong.I recently read a letter from Gerald Oster sent to Aaron Klug after Rosalind Franklin’s death, in which he recalled his time working in London. He reflects that even though he had much of the relevant information by early 1950 he lacked the insight to work out the structure of DNA.  This letter (FRKN/06/07/001-2) is held by the Churchill Archives Centre in Cambridge and a digitised version will become part of the WDL.

I am rather hoping that the WDL might help us to recognise that while flashes of inspiration are part of scientific discovery they are only possible because a team of other people paved the way.

Thursday, April 19, 2012

Clearing copyright for books: preliminary ARROW results

As part of the genetics books project, we are tackling issues of copyright clearance and due diligence head on. Up to 90% of this collection is in copyright, or is likely to be in copyright, so developing a copyright clearance strategy was one of our earliest considerations. This turned into a useful project to test-run the EC-funded ARROW system on a large scale. ARROW provides a workflow for libraries and other content repositories to determine whether books are in-commerce, in copyright, and whether the copyright holders can be identified and traced. This system has undergone small tests throughout Europe, including the UK (using collections and metadata from the British Library), but in order to determine whether ARROW is feasible on a large scale, a realistic large-scale project was needed.

The Wellcome's genetics books project provided this opportunity, and the challenge was taken up by the ALCS and the PLS jointly, as announced previously on our Library Blog. Results from ARROW, combined with the responses from contacted rights holders, determine whether the Wellcome Library will publish a work online.

The collection of (roughly) 1,700 potentially in-copyright books is not enormous, but it is diverse, and has already thrown up some interesting wrinkles in the copyright clearance workflow.

For example, according to the AARC2 standard used to catalogue these books, only up to three authors are included in the metadata record (followed by et al). Works with more than three authors, and collected works such as conference proceedings, had to be manually consulted in order to identify all the named contributors. This inflated the known number of contributors to nearly 7,000 (4 authors on average per book).

Embedded below is a presentation I gave at the London Book Fair earlier this week, which provides an overview of the process, and preliminary statistics from the first 500 books to complete the ARROW workflow.