Wellcome Digital Library: August 2012

We have recently completed processing half a million images in our workflow system (Goobi) ahead of making our new website available to the public later on this year. These images are the product of on-site digitisation of our archives and genetics book collections for the WDL pilot programme.

In previous posts on this blog we have talked about our storage system, server environment, our digital asset management system, and our image "player," but central to all of this is our workflow system, an enhanced and customised version of the open-source software Goobi developed originally at the University of Göttingen, and supported by Intranda.

Goobi is an extremely flexible database system that allows us to create and modify workflows (series of tasks both manual and automatic) for specific projects. These tasks can be recorded in Goobi (such as "Image capture"), done by Goobi (such as "Image conversion JPEG" for viewing images in the Goobi system itself), or can be initiated by Goobi (such as "SDB ingest" which triggers our digital asset management system - SDB - to ingest content).

We are currently working on ingesting images for three different workflows, described and simplified below.

Archive backlog
This workflow allows us to ingest all the images we created from archives since the project began in December 2009 up to May 2012 - around 320,000 of them - from eight collections. So far, we have finished processing around 250,000 images from this set.

The steps for each item to be ingested are as follows (automatic steps are in italics):

Export MARC XML as a batch from the Library catalogue (per collection)
Import metadata as a batch into Goobi
Import JPEG2000 images one folder at a time from temporary storage to Goobi-managed directories
Goobi converts JPEGs for viewing in the Goobi interface
Check that images are correctly associated with the metadata
Add access control "code" to items with sensitive material (restricted or closed)
Trigger SDB ingest, passing along key information to enable this
Import administrative/technical metadata from SDB after ingest
Export METS files to be used by the "player"

Archives digitisation

This workflow deals with the current digitisation by tracking and supporting activities from the very beginning of the digitisation process. So far, we have imported very few images for this project, having just started using this workflow in earnest. The steps include:

Export MARC XML as a batch from the Library catalogue (per collection)
Import metadata as a batch into Goobi
Group metadata into "batches" in Goobi for each archive box (usually 5 - 10 folders or "items")
Track the preparation status at the box level (in process/completed) and record information for the next stage (Image capture)
Track photography status at the box level and record information for the next stage (QA)
Track QA at the box level and return any items to photography if re-work is required
Import TIFF images via Lightroom (which converts RAW files and exportsTIFF files directly into Goobi)
Convert TIFFs to JPEGs
Check that images are correctly associated with the metadata
Add access control "code" to items with sensitive material (restricted or closed)
Convert TIFFs to JPEG2000
See 7-9 above

Genetics books

The other half our ingest effort has focused on the Genetics books. We have imported into Goobi over 250,000 images from this collection since digitisation began in February of this year. This workflow is very similar to Archives digitisation - containing steps related to the entire end-to-end process. The main differences being that as the digisation is being done by an on-site contractor, images are delivered to us as TIFFs, and while there are no sensitivity issues, there is metadata editing to add structure to aid navigation, and a range of "conditions of use" codes depending on the restrictions copyright holders request us to make.

Export MARC XML as a batch from the Library catalogue (whole collection)
Import metadata as a batch into Goobi
Track preparation status at the book level and record information for next stage (Image capture)
Track image capture status at the book level
Track QA status (QA is done on the TIFFs supplied by the contractor)
Import TIFF images one folder at a time from temporary storage to Goobi-managed directories
Convert TIFFs to JPEGs
Check that images are correctly associated with the metadata
Associate images with structural metadata (covers, titlepage, table of contents) thereby enabling navigation to these elements in the "player"
Add page numbering
Add licence code to books that have use restrictions (such as no full download allowed) as per requests by copyright holders
As above

We haven't got it all figured out yet

Other workflows we have not yet put into production, include born digital materials, A/V materials, items with multiple copies/volumes or "parts" (such as a video and its transcript), and manuscripts. We are looking at implementing new or different functionality in Goobi in the near future as well, including JPEG2000 validation using the Jpylyzer script, automated import of images, and configuring existing functionality in Goobi to support OCR and METS-ALTO files to name a few. These changes are aimed at minimising manual interaction with the material to save on time and improve accuracy.

Monday, August 13, 2012

Half a million milestone