Wellcome Digital Library

Digesting Ingest

2012-10-19T15:27:00.000+01:00

Harkiran Dhindsa, and Rioghnach Ahern, Digital Ingest Officers working on the Wellcome Digital Library, describe their experiences using Goobi on a daily basis, and some of the lessons learnt as we scaled up into full production over this past summer:

Goobi is a workflow-based management system that allows us to track and manage the workflows for various digitisation projects, be that archives, books, film or audio files. Many steps are fully automated, including the conversion of TIFF to JPEG2000 and the ingest of content into our repository Safety Deposit Box.

We found the user interface of Goobi to be intuitive. Training in basic ingest processes was quick. A number of the team are using this system. With regular usage, we were working efficiently and became familiar with the functionality. METS editing is facilitated through a web form which allows JPEG images of individual pages to be viewed. Using such a system eliminates the need to keep separate spread-sheets. Because of the way Goobi tracks the workflow by registering each step, it means different staff can continue with tasks at any open step. At any point if an error is noticed - for example a missing image in a book - a correction message can be sent back along the workflow to the appropriate person.

Goobi produces METS files, which describe objects including their access and license status. Although Goobi writes the METS files, the structure of an object is created manually, depending on the project. Much of our time is spent working on METS editing, particularly in adding restrictions for material which contains sensitive data. Goobi can handle a number of projects at the same time, so we can easily switch between working on archives and books. It can handle different tasks simultaneously. For example, an ingest officer can let an image upload task run for, let’s say, an archive collection, while continuing to edit METS data on books.

Lessons learnt

As daily users of Goobi, here are some of the lessons we have learnt:

Prior to import into Goobi, catalogued items are photographed and then the digitised images are checked for data sensitivity. In the early stages of the project, areas that could be improved for a more accurate and efficient workflow became obvious. Amongst one of the first archival collections to be digitised, some of the images that were available as a backlog, and were due to be uploaded into Goobi, did not reflect the archive catalogue (CALM). This was because changes had been made to the catalogue after photography. The lesson learnt from this experience is that photography should only be carried out after cataloguing has been truly completed so that the arrangement of material is firmly established.

To upload images into Goobi, they are first copied from a working network directory to a temporary drive created by Goobi for the user who has accepted the upload task. This process can be terminated by other activities if the network to the local PC is running at full capacity. When this happens we have to redo the transfer, taking extra time to complete the task. Thankfully, this image upload task will be automated in the future, bypassing the local PCs completely. However, the running of several tasks simultaneously will still be limited by server capacity when uploading large files.

After METS editing was completed on one of the archive collections, we were given further sensitivity data. To add these new sensitivity restrictions, we had to “roll back” processes that had already been ingested, thereby re-running part of the workflow. It is very easy to prompt a second ingest into the digital asset management system in the process, resulting in duplicated sets of files, as the roll-back process is less intuitive and not intended for regular use. Again, we have learned an important lesson. It will always be necessary to edit METS files. Changes to the workflow steps in Goobi to make this more straightforward would be useful, but it would be even better to finalise sensitivity lists before METS editing is completed in order to minimise duplication of effort.

A workflow system such as Goobi becomes imperative when ingesting mass collections of archives and books. Images that have gone through the complete ingestion process in Goobi will be accessed online via the Player. Seeing the images in an attractive interface is a satisfactory part of this work as this is where all of the different tasks come to fruition: the digitised archives and books available to the public to view in a user-friendly form — soon to be publicly available!

Authors: Harkiran Dhindsa and Rioghnach Ahern

Half a million milestone

2012-08-13T16:29:00.000+01:00

We have recently completed processing half a million images in our workflow system (Goobi) ahead of making our new website available to the public later on this year. These images are the product of on-site digitisation of our archives and genetics book collections for the WDL pilot programme.

In previous posts on this blog we have talked about our storage system, server environment, our digital asset management system, and our image "player," but central to all of this is our workflow system, an enhanced and customised version of the open-source software Goobi developed originally at the University of Göttingen, and supported by Intranda.

Goobi is an extremely flexible database system that allows us to create and modify workflows (series of tasks both manual and automatic) for specific projects. These tasks can be recorded in Goobi (such as "Image capture"), done by Goobi (such as "Image conversion JPEG" for viewing images in the Goobi system itself), or can be initiated by Goobi (such as "SDB ingest" which triggers our digital asset management system - SDB - to ingest content).

We are currently working on ingesting images for three different workflows, described and simplified below.

Archive backlog
This workflow allows us to ingest all the images we created from archives since the project began in December 2009 up to May 2012 - around 320,000 of them - from eight collections. So far, we have finished processing around 250,000 images from this set.

The steps for each item to be ingested are as follows (automatic steps are in italics):

Export MARC XML as a batch from the Library catalogue (per collection)
Import metadata as a batch into Goobi
Import JPEG2000 images one folder at a time from temporary storage to Goobi-managed directories
Goobi converts JPEGs for viewing in the Goobi interface
Check that images are correctly associated with the metadata
Add access control "code" to items with sensitive material (restricted or closed)
Trigger SDB ingest, passing along key information to enable this
Import administrative/technical metadata from SDB after ingest
Export METS files to be used by the "player"

Archives digitisation

This workflow deals with the current digitisation by tracking and supporting activities from the very beginning of the digitisation process. So far, we have imported very few images for this project, having just started using this workflow in earnest. The steps include:

Export MARC XML as a batch from the Library catalogue (per collection)
Import metadata as a batch into Goobi
Group metadata into "batches" in Goobi for each archive box (usually 5 - 10 folders or "items")
Track the preparation status at the box level (in process/completed) and record information for the next stage (Image capture)
Track photography status at the box level and record information for the next stage (QA)
Track QA at the box level and return any items to photography if re-work is required
Import TIFF images via Lightroom (which converts RAW files and exportsTIFF files directly into Goobi)
Convert TIFFs to JPEGs
Check that images are correctly associated with the metadata
Add access control "code" to items with sensitive material (restricted or closed)
Convert TIFFs to JPEG2000
See 7-9 above

Genetics books

The other half our ingest effort has focused on the Genetics books. We have imported into Goobi over 250,000 images from this collection since digitisation began in February of this year. This workflow is very similar to Archives digitisation - containing steps related to the entire end-to-end process. The main differences being that as the digisation is being done by an on-site contractor, images are delivered to us as TIFFs, and while there are no sensitivity issues, there is metadata editing to add structure to aid navigation, and a range of "conditions of use" codes depending on the restrictions copyright holders request us to make.

Export MARC XML as a batch from the Library catalogue (whole collection)
Import metadata as a batch into Goobi
Track preparation status at the book level and record information for next stage (Image capture)
Track image capture status at the book level
Track QA status (QA is done on the TIFFs supplied by the contractor)
Import TIFF images one folder at a time from temporary storage to Goobi-managed directories
Convert TIFFs to JPEGs
Check that images are correctly associated with the metadata
Associate images with structural metadata (covers, titlepage, table of contents) thereby enabling navigation to these elements in the "player"
Add page numbering
Add licence code to books that have use restrictions (such as no full download allowed) as per requests by copyright holders
As above

We haven't got it all figured out yet

Other workflows we have not yet put into production, include born digital materials, A/V materials, items with multiple copies/volumes or "parts" (such as a video and its transcript), and manuscripts. We are looking at implementing new or different functionality in Goobi in the near future as well, including JPEG2000 validation using the Jpylyzer script, automated import of images, and configuring existing functionality in Goobi to support OCR and METS-ALTO files to name a few. These changes are aimed at minimising manual interaction with the material to save on time and improve accuracy.

OCRing typescript: A benchmarking test with PRImA

2012-07-04T09:06:00.003+01:00

During our pilot phase, we plan to add over 1 million images of archives to the Wellcome Library website, all created during the 20th century. Much of this is comprised of typescript: letters, drafts of papers and articles, research notes, memos, and so on. Key to improving discoverability of these papers is OCR (optical character recognition), which allows us to encode the words as text and include them in a full-text index. We set out to test whether typescript material would provide accurate OCR results that could be included in our index.

OCR technology

OCR works by segmenting a block of text to the individual character level and then comparing the patterns to a known set of characters in a wide range of typefaces. The accuracy of character recognition relies on the source information having clear-cut, and common, letter forms. The accuracy of word recognition relies on both character recognition, and the availability of comprehensive dictionaries for comparison. OCR software can compare these words to dictionaries to enhance word recognition, and also to estimate accuracy rates (or levels of confidence).

When it comes to good quality, clearly printed text, OCR can be extremely accurate even without any human intervention - with rates of 99% or higher for modern printed content (less than one word out of one hundred words having at least one inaccurate character), reducing as you recede in time to 95% for 1900 - 1950 printed material, and lower for 19th century material (see Tanner, Munoz and Ros, 2009). For some formats, OCR is worse still - as the document mentioned above shows, 19th century newspapers may only reach 70% significant word accuracy (words not including "stop" words such as definite/indefinite articles, and other non-search terms).

Typescript testing

Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. Some will OCR very well, such as professionally printed matter. But much non-handwritten content is by the nature of the age of this material in a typescript form. We had no idea how well this type of content would OCR. To find out we commissioned Apostolos Antonacopoulos and Stefan Pletschacher based at the University of Salford and members of PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability (such as post-processing of particular images).

Apostolos and Stefan chose a selection of 20 documents from a larger sample we provided originating from our Mourant and Crick digitised collections. These 20 documents where manually transcribed using the Aletheia groundtruthing tool for comparison to the output of three OCR engines, Abbyy FineReader Engines 9 and 10 and Tesseract, open source OCR software.

The results of the OCR benchmarking test show that original, good quality typescript content can reach up to 97% significant word accuracy with Abbyy Fine Reader Engine 10 (such as this example below):

At the bottom end, carbon copies with fuzzy ink can result in virtually 0% accuracy in any OCR software:

What was pleasantly surprising was how the average-quality and poorer content fared. On the better end of the scale we have 93% accuracy despite some broken characters:

And here we have poorer quality typescript producing 72% accuracy with many faint and broken characters:

The average rate for 16 images of good to poorer quality typescript is 83% significant word accuracy (excluding the carbon copies).

Accuracy levels are reported here according to the results from Abbyy FineReader Engine 10 on the "typescript" setting. The reported accuracy rates covers all the visible text on the page including letterheads, pre-printed text such as contact details, text overwritten by manual annotations and so on. Naturally, errors are more likely to occur in these areas, which (except in the case of text overwritten by manual annotations) are not of much significance in terms of indexing and discoverability. Further tests would be required to determine what the accuracy rate is with these areas excluded. For example, it may be possible to digitally remove the annotations in this draft version of Francis Crick and James Watson's "A Structure for DNA" to raise the overall accuracy rate (currently only 45%):

There is some variation between FineReader 9 and 10 where one or the other may have a small advantage with a few cases showing as much as a 30% difference. Overall, there is only 1% difference when looking at averages between 9 and 10. Tesseract, on the other hand, was far less accurate especially for the poorer quality typescript (roughly half as accurate overall).

There are a few things we could do to improve accuracy:

Incorporate medical dictionaries to improve recognition and confidence of scientific terms
Enhance images to "remove" any annotations prior to OCR'ing
Develop a workflow that would divert images down different paths depending on content ("typescript" path, FineReader 9 or 10, enhancements to be applied or not applied, etc.)

We may find that an average of 83% word accuracy overall is perfectly adequate for our needs in terms of indexing terms and allowing people to discover content efficiently. Further investigation is required, but this report has given us a good foundation from which to press on with our OCR'ing plans.

These digital collections are not yet available online, but will be accessible from autumn 2012.

Serving servers: a technical infrastructure plan

2012-05-15T17:53:00.000+01:00

As we aim to provide a fast, efficient and robust technical architecture for the Wellcome Digital Library, the Wellcome Trust IT department has been working closely with our software suppliers to specify a suitable server architecture. This work is still in progress, but we now have the skeleton idea of how many servers we are likely to need and for what purposes. The scale of the architecture requirements shows that setting up and delivering digital content is a significant undertaking.

In order to serve up millions of images, plus thousands of A/V files, born digital content and the web applications that make them accessible, we believe we’ll need around 17 (virtual) servers for the production environment (the “live” services), and a further 10 servers for our staging and development environments. In the production environment, nearly every server is duplicated to ensure redundancy and a smooth delivery service, which is why the numbers are so high. The content management system coupled with its SQL database requires four servers, for example. The image delivery environment needs six servers for data delivery, on-the-fly image conversion and tile creation, and media proxy servers creating digital content URLs that divorce the user-request mechanism from the actual content held on our servers for security reasons.

Most of the servers run on Windows 2008, although our image server (IIPImage) will run on Linux Ubuntu. The virtual servers share CPUs, but the number of CPUs available mean that each server gets the equivalent of either 2 or 4 CPUs, leading to a total 48 CPU requirement (288 cores as each CPU has 6 cores) . RAM varies from 2GB to 8Gb depending on the anticipated usage of a particular application on that server. The total RAM requirement for the production architecture is estimated at 124Gb. These specifications are currently our best guess, and will be tested in the weeks to come as we start to deploy the hardware.

The staging environment allows system upgrades, patches or new development work to be applied and tested separately from the live production environment. This means that any changes can be tested thoroughly before changes are made publicly visible and/or usable. Actual development work is carried out in the development environment, before deployment for final testing on the staging servers. This means that applications such as the web content management system and the delivery system applications must be replicated in these two additional environments, along with their server requirements.

With thanks to David Martin, IT Project Manager, as the source of my information.

Developing a player for the Wellcome Digital Library

2012-05-11T15:37:00.000+01:00

Previous posts here have covered the digitisation of books and archives and the storage of the resulting files (mostly JPEG2000 images, but some video and audio too). Now it’s time to figure out how visitors to the Wellcome Library site actually view these materials via a web browser.

The digitisation workflow ends with various files being saved to different Library back-end systems:

The METS file is a single XML document that describes the structure of the book or archive, providing metadata such as title and access conditions.
Each page of the book (or image of an archive) is stored as a JPEG2000 file in the Library’s asset management system, Safety Deposit Box (SDB). Each image file in SDB has a unique filename (in fact a GUID), and this is referenced in the METS file. So given the METS file and access to the asset management system, we could retrieve the correct JPEG 2000 images in the correct order.
Additional files might be created, such as METS-ALTO files containing information about the positions of individual words on a digitised page; we’ll want to use this information to highlight search results within the text.

So how do we use these files to allow a site visitor to read a book?

Rendering JPEG 2000 files

Our first problem is that we can’t just serve up a JPEG2000 image to a web browser – the format is not supported. And even if it was, the archival JPEG2000 files are large: several megabytes each. The solution to this problem is familiar from services like Google Maps – we break the raw image up into web-friendly tiles and use them at different resolutions (zoom levels). When you use Google Maps, you can keep dragging the map around to explore pretty much anywhere on Earth – but your browser didn’t load one single enormous map of the world. Instead, the map is delivered to you as 256x256 pixel image files called tiles, and your browser only makes requests for those tiles that are needed to show the area of the map visible in your browser’s viewport. Each tile is quite small and hence very quick to download – here’s a Google map tile that shows the Wellcome Library:

http://mt1.google.com/vt/lyrs=m@176000000&hl=en&src=app&x=65487&s=&y=43573&z=17&s=Ga

Google Maps is a complex JavaScript application that causes your browser to load the right tiles at the right time (and in the right place). This keeps the user experience slick. We need that kind of user experience to view the pages of books.

There are several JavaScript libraries available that solve the difficult problem of handling the viewport and generating the correct tile requests in response to user pan and zoom activity. We’ve settled on Seadragon, because we really like the way it zooms smoothly (via alpha blending as you move from one zoom level’s tiles to another). A very nice existing example of this is at the Cambridge Digital Library’s Newton Papers project:

http://cudl.lib.cam.ac.uk/view/PR-ADV-B-00039-00001/

This site uses a viewer built around Seadragon; an individual tile looks like this:

http://cudl.lib.cam.ac.uk/content/images/PR-ADV-B-00039-00001-000-00105_files/11/3_2.jpg

The numbers on the end indicate that this jpeg tile is for zoom level 11, column 3, row 2. As you explore the image, your browser makes dozens, even hundreds of individual tile requests like this. It feels fast because each individual tile is tiny and downloads in no time.

For more about tiled zoomable images, these blog posts are an excellent introduction:

So how do we get from a single JPEG2000 image to hundreds (or even thousands) of JPG tiles? It’s possible to prepare your image tiles in advance, so that you process the source image once and store folders of prepared tiles on your web server. For small collections of images this is a simple way to go and doesn’t require anything special on the server. But for the Library, it’s not practical – we don’t want to prepare tiles as part of the digitisation workflow. They are not “archival”, and they take up a lot of extra storage space. We need something that can generate tiles on the fly from the source image, given the tile requests coming from the browser.

For this we need an Image Server, and we’ve chosen IIPImage for its performance and native Seadragon (Deep Zoom) support. The Image Server generates browser-friendly JPEG images from regions of the source image at particular zoom levels. When your browser makes a request to the image server for a particular tile, the image server extracts the required region from the source JPEG 2000 file and serves it up to you an ordinary JPEG.

Viewer or Player? Or Reader?

The next piece of the puzzle is the browser application that makes the requests to the server. A book or archive is a sequence of images along with a lot of other metadata. And it’s not just books – the Library also has video and audio content. All of these are described in detail by METS files produced during the digitisation/ingest workflow. In the world of tile-based imaging, the term “viewer” is often used to describe the browser component of the system, but we seem to have fallen naturally to using the term “Player” to describe it – it plays books, videos and audio, so “Player” it is. Our player needs to be given quite a lot of data to know what to play.

We could just expose the METS file directly, but it is large and complex and much of it is not required in the Player. So we’re developing an intermediate data format, which effectively acts as the public API of the Library. Given a Library catalogue number, the player requests a chunk of data from the server; this tells it everything it needs to know to play the work, in a much simpler format than the METS file. In the future other systems could make use of this API (at the moment it’s exposed as JSON).

The user experience

The user won’t just be viewing a sequence of images, like a slide show. It should be a pleasant experience to read a book from cover to cover. Many users will be using a tablet, reading pages in portrait aspect ratio. We aim to make this a good e-reading experience too, augmented by search and navigation tools.

The user experience might start with a search result from the Library’s main search tool. For books that have been digitised, the results page will provide an additional link directly to the player “playing” the digitised book. The URL of the book is an important part of the user experience, and we want to keep it simple. In future, library.wellcome.ac.uk/player/b123456 would be the URL of the work with catalogue refrence number b123456; that URL would take you straight to the player.

We want to be able to link directly to a particular page of a particular book, just as a printed citation could. This deeper URL would be /player/b123456#/35. But we can do better than that; our URL structure should extend to describe the precise region of a page, so that one reader could line up a particular section of text on a page, or a picture, and send the URL to another reader; the second reader would see the work open at the same page, and zoomed in on the same detail.

Access Control

Much of the material being made available is still subject to copyright. Those works that are cleared for online publication by the Trust’s copyright clearance strategy still need some degree of access control applied to them; typically the user will be required to register before viewing them. This represents a significant architectural challenge, because we need to enforce access restrictions down to the level of individual tile requests. We don’t want anyone “scraping” protected content by making requests for the tiles directly, bypassing the player.

Performance and Scale

As well as the technical challenges involved in building the Player, we need to ensure that content is served to the player quickly. Ultimately the system will need to scale to serve millions of different book pages. Between the player and the back end files is a significant middle tier: the Digital Delivery System, of which the Player is a client. This layer is the Library’s API for Digital Delivery. The browser-based player interacts with it to retrieve data to display a book, highlight search results, generate navigation and so on. The Image Server is a key component of this system.

This post was written by Tom Crane, Lead Developer at Digirati, working with his colleagues on developing digital library solutions for the Wellcome Digital Library.

Will more data lead to different histories being told?

2012-04-30T10:46:00.001+01:00

New technology is making information more widely available and, when it launches later this year, the WDL will make it easier to access historical evidence about the foundations of modern genetics. Will this democratize our understanding of the history of genetics and lead to different versions of the history being told?

There is an African proverb which says that history is written by the hunter not the lion. History inevitably simplifies the past and the selection process can be subjective. When it launches the WDL will start to put 21 archive collections and around 2,000 books on-line. The project is to digitise as much as we can rather than cherry pick the highlights. This means that the building blocks used by historians to piece together the past will be made freely available to a wider audience. A lot of this material may seem like mundane workaday stuff. Users will have to wade through a lot of material to reach the bits they are interested in but this is probably a more accurate reflection of the scientific research process.

Flashes of genius are essential but they do not happen in isolation. Thomas Edison’s phrase about invention being 1% inspiration and 99% perspiration applies to scientific research too. The discovery process needs both.

Watson and Crick were extremely clever to work out the helical structure of DNA but they did not get there simply because they were lone geniuses. Before they made their discovery a lot of people had spent years experimenting, writing and thinking about DNA. There had even been flashes of insight which ended up being wrong.I recently read a letter from Gerald Oster sent to Aaron Klug after Rosalind Franklin’s death, in which he recalled his time working in London. He reflects that even though he had much of the relevant information by early 1950 he lacked the insight to work out the structure of DNA. This letter (FRKN/06/07/001-2) is held by the Churchill Archives Centre in Cambridge and a digitised version will become part of the WDL.

I am rather hoping that the WDL might help us to recognise that while flashes of inspiration are part of scientific discovery they are only possible because a team of other people paved the way.

Clearing copyright for books: preliminary ARROW results

2012-04-19T09:57:00.000+01:00

As part of the genetics books project, we are tackling issues of copyright clearance and due diligence head on. Up to 90% of this collection is in copyright, or is likely to be in copyright, so developing a copyright clearance strategy was one of our earliest considerations. This turned into a useful project to test-run the EC-funded ARROW system on a large scale. ARROW provides a workflow for libraries and other content repositories to determine whether books are in-commerce, in copyright, and whether the copyright holders can be identified and traced. This system has undergone small tests throughout Europe, including the UK (using collections and metadata from the British Library), but in order to determine whether ARROW is feasible on a large scale, a realistic large-scale project was needed.

The Wellcome's genetics books project provided this opportunity, and the challenge was taken up by the ALCS and the PLS jointly, as announced previously on our Library Blog. Results from ARROW, combined with the responses from contacted rights holders, determine whether the Wellcome Library will publish a work online.

The collection of (roughly) 1,700 potentially in-copyright books is not enormous, but it is diverse, and has already thrown up some interesting wrinkles in the copyright clearance workflow.

For example, according to the AARC2 standard used to catalogue these books, only up to three authors are included in the metadata record (followed by et al). Works with more than three authors, and collected works such as conference proceedings, had to be manually consulted in order to identify all the named contributors. This inflated the known number of contributors to nearly 7,000 (4 authors on average per book).

Embedded below is a presentation I gave at the London Book Fair earlier this week, which provides an overview of the process, and preliminary statistics from the first 500 books to complete the ARROW workflow.

Copyright Clearance for Genetics Books, A pilot project at the Wellcome Library

View more PowerPoint from Wellcome

Learning lessons on the Genetics Books digitisation project

2012-04-02T12:25:00.002+01:00

A key component of the theme of our digitisation pilot programme - "Foundations of Modern Genetics" - is a set of printed textbooks and secondary sources published between 1850 and 1990 that shed light on the development of genetic and genomic research. The total collection identified is around 2,000 books. The goal is to digitise these texts in full, and make them freely available online via the Wellcome Digital Library (we are of course dealing with copyright clearance).

Digitisation of books often looks and sounds straightforward. It is not always straightforward of course - but the new book scanners on the market these days do make it quick. There are standard ways of book scanning - you put the book on a cradle, and either turn the pages (by hand), or use a "robotic" contraption that turns the pages automatically. You can use scanning technology, or one-shot dSLR cameras; panes of glass to hold the pages down, or small grips on the outer margins of the pages. The choice depends on the physical nature of the books and how quickly you want to digitise. Even when outsourcing it is useful to understand how book scanning really works. Our Genetics Books digitisation project - a pilot project - is giving us this opportunity.

We commissioned local digitisation company Bespoke Archive Digitisation to carry out the digitisation work for this pilot project. As the digitisation is carried out on site, we have been involved to some extent in all aspects of the digitisation, including the setup and use of new types of equipment, the QA process involved in book digitisation, and the workflow of image conversion and delivery. As we have never carried out high-throughput book digitisation at the Wellcome Library before, this has been a huge learning curve for us, allowing us to gain knowledge that will come in very useful in the future with new (and hopefully larger) projects.

Bespoke Archive Digitisation uses a robotic book scanner and a manual book scanning unit (for books that are not robust enough for the robotic scanner, are outsized, etc.). Both of these "scanners" use Canon 5D Mark II cameras, two per unit to capture each page of an opening simultaneously. The robotic book scanner is the latest version from Kirtas, the Kabis III. Richard Keenan, owner of Bespoke Archive Digitisation explains, "this unit has a number of time-saving features such as "fluffers," a “snubber,” and a self adjusting book cradle which moves to keep the book at the correct angle to be photographed. This is accomplished through various sensors and lasers, which monitor the book throughout imaging to keep it in the correct position, but must also be monitored by the operator."

A key lesson, according to Richard, is that "although all robotic book scanners include a published throughput (2,890 pages per hour for this particular unit), it is important to understand that the published throughputs do NOT mean that you can do 2,890 pages per hour, hour after hour without stopping. Each book must be set up on the cradle, the cameras may need some adjustment/focusing, and page turning does require manual intervention, every time, to ensure the pages are flat, and to prevent page curvature and glare (especially on sealed paper).

"Also, it is very important to remember that this is just the image capture stage, the pages then have to be batch processed, edited and rigorously quality assessed which can take the same, or more time than imaging. Depending on the book's structure - page thickness, binding type, size of the book etc - you will find that speeds vary considerably, a realistic estimate of throughput over a significant period of time is approximately 1000 pages per hour, but this can be much lower with some books.

"Although these figures differ by a large margin from those published, the Kabis III from Kirtas is still probably the fastest way to digitize books, and the important thing is that the quality of output produced is excellent if operated correctly. The on board editing software 'Book Scan Editor' is very handy, offering the usual cropping, image adjustment and sharpening options, but also deskewing and xml conversion and even OCR. I would say that another thing to bear in mind here, is that there is a large learning curve with this technology, so for anyone thinking of using one - particularly those who have no experience with robotic book scanners - plan plenty of time in the project for training and testing periods."

Three Geneticists from the University of Glasgow

2012-03-23T16:42:00.001+00:00

The Wellcome Digital Library isn’t just about collections held in the Wellcome Library. We are working with a number of other organisations that hold material on the history of modern genetics. One of the contributing partners, the University of Glasgow Archives Service, have just started to digitise the archives of three men who worked at the University’s Department of Genetics - Guido Pontecorvo (1907-1999), James (Jim) Harrison Renwick (1926-1994) and Malcolm Ferguson-Smith (1931 – ).

You can see photos of their brand new digitising suite here.

Wellcome Digital Library update

2012-03-12T12:36:00.000+00:00

The Wellcome Digital Library pilot has been underway for 18 months with 6 months to go before we launch the new Library website. This will provide access to a wide range of digital content related to the Foundations of Genetics theme. All of the work done so far has been behind the scenes: digitising content, procuring and developing our digital library systems, and designing a new website. We are looking forward to displaying the product of all this work to the public - but we're not quite there yet!

So where are we now and what will we be doing in 2012? Here is a snapshot of progress so far. Further details on some of these projects can be found on this blog, and we will continue to explain our activities in more detail in future posts.

Digitisation

Archives: With our in-house team of two photographers, we have digitised around 380,000 pages from the collections of Crick, Mourant, Medawar, Sanger, Wyatt, Grueneberg, and the Blood Group Unit. We have just started the Eugenics Society collection, which will carry on throughout the spring and summer.
Genetics Books: This project has just begun, with up to 2,000 books to be digitised this spring by an external supplier, Bespoke Archive Digitisation, working on-site.
MOH reports: A successful JISC funding bid meant we could add the Greater London Medical Officer of Health reports to the pilot project. Conservation is underway, and digitisation will begin in a few month's time.
ProQuest: We have partnered with ProQuest to digitise our pre-1700 printed books for Early European Books online, with over 1,000 books now digitised and around 13,000 to go. Those with subscriptions and anyone in the UK can view our first 400 books on the EEB website with more to come shortly (search for "Wellcome").
External content: We have had the first delivery from one of our external partners, Cold Spring Harbour Laboratory, including correspondence from the James Watson archive. This adds around 50,000 images to our digital archive collections, with more to come throughout 2012 and early 2013 from all partners.
Copyright and sensitivity: Hand in hand with digitisation, we are assessing our content for sensitivity and copyright issues where necessary. Sensitive items (containing certain types of private information as defined by the Data Protection Act) are identified and flagged as unsuitable for online dissemination. Copryight clearance of in-copyright works is underway with the help of the Authors' Licensing and Collecting Society, and the Publishers' Licensing Society.

Systems development

Digital Asset Management & Storage: Safety Deposit Box 4.1, our digital asset management system, was extended to provide extra functionality for large sets of digital assets in 2011. This system is now in production. Our storage system, Pillar, now includes a Write Once Read Many (WORM) backup drive to ensure that our files are secure in the long term.
Workflow system: We procured Goobi (Intranda Version) with bespoke modifications in 2011 to act as a workflow system, enabling us to track project progress, and to automate a number of activities (including ingest of content into Safety Deposit Box). This has recently been put into use in production, particularly for the Genetics books digitisation project. Soon we will be using Goobi for all digitisation projects, and to ingest our backlog of images.
JPEG 2000: We now archive all our images in the JPEG 2000 (Part 1) format, and have an automated batching process set up with LuraWave. Soon, we will be implementing JPEG 2000 validation as part of this process to ensure all JPEG 2000s meet the correct standards before ingest.
Digital delivery: A new digital delivery system is currently under development that will interoperate with Safety Deposit Box and our new website content management system, Alterian CM7. We have commissioned CM7 developers Digirati to carry out this development, which will be completed at the end of the summer. So far they have produced a proof of concept system that demonstrates an end-to-end sequence from retrieval of images from Safety Deposit Box using METS files created by Goobi, to displaying images online. They are adapting Seadragon, the MS viewer used by several other digital libraries, to meet our specific needs and design criteria.
Search and discovery: We are also making changes to our single search system, Encore. This work is looking at providing better representation of archival metadata in Encore, and also options for incorporating a full-text index. The purpose is to provide access to all Library content - the catalogues as well as the digitised materials - via a single interface.

New website and user experience

User experience-led design: Last year the Library brought on board external suppliers Clearleft - user experience and web design experts - to help redesign the information architecture and visual appearance of the new website. New designs are already visible on the internal web development environment, so further user testing of a real website can soon be done.
Transferring content: The Library has carried out a full content audit of the current website, and prioritised content to carry across to the new site. The current site contains over 2,000 pages; this will be considerably reduced. The content carried across to the new site will be thoroughly edited to ensure it is up-to-date and consistent with the new site "style".
Creating new content: New content will also be created once the content management system is in action, with a focus on the Foundation of Modern Genetics. This is a major part of the Library's aim to provide interpretative content to both researchers and the "curious public".

Filling the MOH Gaps

2012-03-05T10:09:00.023+00:00

The Wellcome Library has a great collection of Medical Officer of Health (MOH) Reports. These reports are stuffed full of grim and useful information from the 19th and 20th century, such as statistics on infant mortality. JISC is, very wisely, funding the digitisation of the London reports. There are some gaps in the Wellcome Library’s collection so, in order to make a really useful digital resource, we have been working out what is missing. This has not been straightforward.

First we needed to check what we held and then we needed to make sure that our gaps really were gaps. We didn’t want to waste time looking for reports that were never created. The very first MOH report in Britain was produced in Liverpool in 1847. The first London report was produced in the following year but the early reports do not cover the whole of London. The Public Health Act of 1848 permitted local authorities to employ MOHs but, since it was not obligatory, only a minority did. The Metropolis Management Act of 1855 required MOHs to be appointed in central London but the big change came with the 1875 Public Health Act. From then until 1972 the production of MOH reports was pretty solid.

Another challenge was getting to grips with the boundary changes. Over the years the administrative boundaries of London have altered several times. The current 32 London Borough boundaries date from 1965 when Greater London was established. Before that there were 28 metropolitan boroughs plus various boroughs, urban and rural district councils in what is now, outer London. Before 1899 much of what we now think of as London was part of Kent, Middlesex, Essex or Surrey. The City of London has long gone its own distinctive way and the tangle of parish boundaries there is particularly confusing. Old maps and a book on administrative units by Frederick A. Youngs helped us to make sense of all these changes.

We have decided to start from the centre and try to create a complete as record as possible for the 12 inner London Boroughs. We’ve got to the stage where we have a list of reports that we want to find. The next step is to track them down in other collections and ask if we can get them digitised.

Watson and Crick Letters

2012-03-05T10:03:00.006+00:00

I've just had the privilege of reading a fantastic series of letters written in 1954 by James Watson and Francis Crick. They were written a year after they published their seminal article on the structure of DNA. In the letters the two men are exchanging ideas and their excitement shines through. They write about all sorts of things, for example, the importance of building space filling three dimensional models, confusion over how thymine fits into the helical structure and what the researchers at KCL are up to. In March 1954 Watson also expresses his frustration with the research process, “The whole thing is puzzling and paradoxical (for could DNA be wrong) and is slowly driving me to despair and to loath nucleic acids.” (PP/CRI/D/2/45)

I got to read them because last month the first batch of digitised material arrived from Cold Spring Harbor Laboratory in New York, one of the five external organisations contributing digitized material to the WDL pilot project. The James Watson archive is held at Cold Spring Harbor and contains the letters written to him by Francis Crick. The letters Watson wrote to Crick are held by the Wellcome Library.

Later this year, when the WDL is launched, Watson and Crick’s correspondence will be digitally united. Lots of people will be able to read these letters (and lots of other stuff) online while the originals stay safely tucked away in their archival homes. I am excited about that!

The Medical Officer of Health reports project begins.

2012-02-21T16:09:00.000+00:00

I recently started working on the Medical Officer of Health digitisation project.

Spine with methylcellulose applied

Since the beginning of December 2011, I have been spending the majority of my time in the conservation studio. I have been carrying out disbinding, cleaning and rehousing of the late 19th century Medical Officer of Health (MOH) reports that are bound by year. This is so that the digitisers can scan or photograph them for the project.

The MOH reports in our collection are shelved in different sequences: Main, London and Provincial. However, all three sequences contain reports for London areas. Main and Provincial sequences are bound by geographical area and generally they're in a good condition but the London sequence reports are bound by year and tend to be in a poorer state due to their heavy use. There are about 80 bound volumes in the London sequence and they are all being disbound. Eventually we will be able to house all of the London reports together by geographical area.

Lining coming off

In order to take the bound reports apart, I first need to remove the cloth case and spine linings whilst keeping the pamphlets intact. I use a 4% methylcellulose solution to break down the binding animal glue. A major challenge with this is that each volume’s binding breaks down at a different rate so it requires constant checking to avoid damaging the paper. On average it takes a couple of hours just to remove the spine linings. We recently purchased a pink portable clothes steamer to try to speed up the process. I haven’t tried this yet but we are hopeful that this will work faster.

The aftermath of removal

Another important part of the project is creating separate bibliographic records for each report. We have decided to catalogue these as monographs in order to improve searching and allow users to find reports by fields such as geographic area, Medical Officer's name and date of the report.

Whilst I am going through the reports, I have been finding some very interesting snippets. I will include some as I continue to blog so look out for them!

Guest blog post: Digitising early printed books at the Wellcome Library

2011-11-28T09:34:00.001+00:00

Karine Larose and Jeremy Uhl are working on the Early European Books digitisation project at the Wellcome Library. They are empolyees of Diadeis, a digitisation company based in France, and work on site alongside the rest of the Wellcome's digitisation team. In this guest post, Karine and Jeremy explain what they do here at the Wellcome Library:

Hello, Karine and Jeremy here from Diadeis. We are the production team currently working on the digitization project for ProQuest at the Wellcome library. The following will be a sort of 'day in the life' of our department. We hope you enjoy it and would love any comments, questions, or suggestions.

Each day, when we start, new books are delivered to us by Matt Brack, the Digitisation Support Officer, for digitisation. We then register them in the database (also known as the check-in process). First we assess the books to determine if they are suitable for scanning. We might need to refuse a book if

the book is too big or thick for the scanning machines to handle (usually higher than 42 cm and wider than 30 cm)
the book cannot open more than 100 degree (later in the project these books will be scanned using different equipment)
The book is too fragile and we feel that we cannot scan it without causing more damage.

Also during the check-in process we might notice other minor problems with the books but not so severe as to refuse them. In that case we usually put them on hold. An example would be if most of the pages are stuck/glued together. ‘Problem books’ on hold are placed on a separate shelf, and we later consult with Matt as to what action has to be taken. Simultaneously, the books we scanned on the previous days are given back to Matt and we update the database to show that they have been returned to the library.

Before scanning the pages of the books, the spine and edges are photographed first. This step is to show the condition of the book before we start the scanning. We do our best during the scanning process to be careful with the books, to return them in the state they were delivered. The books are scanned using two main scanners, one for restricted angles and one for opened books. A few problems we sometimes have when scanning books are:

Huge fold-outs that cannot be scanned with the current equipment
missing pages
pieces of a book coming apart (usually the spine)

Our most common problem is dust! This includes tiny book pieces, occasionally hair, once we even found tiny bits of cereals in a book. There are lots of surprises and we never know what we may find in between pages of a book. Sadly we have yet to find any money in any of the books ;). Several times during the day we have to clean the glass panel; for that we use a special cloth and a vacuum cleaner. The archive office kindly supplied us with alcohol wipes which has a disinfection effect. Also because some of the books have leather covers, our hands can become oily after handling them and we have to wash our hands regularly.

Once the scanning day is over, we pack all the images in a packaging software and send them off to the indexing team for quality control, cropping and indexing.

Working on this project at the Wellcome library has been fun and exciting because we get to see rare books every day and provide digital access to them to people all over the world. We are pleased to be given the chance to explain our daily routine to the followers of this blog and look forward to any comments they might have.

We would like to thank Matt for being so flexible with our timetable and bringing us lots of books everyday and Christy for her support and giving us the opportunity to write for the blog. For those of you who work for the Wellcome library do come along and say "hello" to us, the room may be dark but I assure you we are here!

Authors: Karine Larose and Jeremy Uhl

NB: The first batch of 400 books can now be seen on the EEB website (do a Quick Search for Wellcome). These are freely available to Wellcome Library members and everyone in the UK.

Preserving our digital assets #2 - WORM storage

2011-09-21T09:14:00.002+01:00

This post follows on from my earlier description of our DAM system (Safety Deposit Box or SDB), which manages long-term preservation and access to our digital assets. Here we turn to the back-end of the back-end: storage.

There are a number of requirements that must be met to safely storing digital assets. The storage solution must be:

Secure (behind a firewall)
Robust (able to manage points of failure in the disks)
Replicable (multiple copies on multiple sites)
Scalable (able to handle tens of millions of files)
Quick to access (the archived files are also used for delivery)

After considering different systems and suppliers – including robotic tape back-up – we settled on a solution that gives us the confidence we need for long-term preservation, and fits well with the Trust’s existing storage infrastructure. Our existing storage suppliers, Pillar Data Systems, have extended the existing RAID5 enterprise storage system used for all the Wellcome Trust’s business needs by incorporating a “Write Once, Read Many” (WORM) back-up storage server for use by the Wellcome Digital Library. Associated management software copies files from the main storage server to the WORM, and monitors the main server for file errors that can be “healed” using the WORM copy.

To explain this in a bit more detail, related to our primary requirements:

Security means that only authorised users or systems are able to access the files, and that unauthorised deletions or changes are guarded against. Locking master files behind a firewall is the main form of defence from unauthorised external access (i.e. hackers, or the ability to download files by "guessing" the network path and filenames). However, we still faced the prospect of file deletion, changes or corruption due to system failures, and accidental or malicious actions by otherwise authorised users. In order to eliminate this possibility, the Trust IT Department recommended permanent, on-line WORM storage as our back-up solution. Files stored on the WORM can be accessed, but they cannot be overwritten or deleted. This means that we have a permanent back-up of every digital file that cannot be tampered with.

Robustness is tied up with the WORM system - although the RAID5 storage we normally use is highly robust. It distributes the bit stream of digital files in such a way that points of failure generally do not damage the entire file, allowing it to be reconstituted, while short-term back-ups allows complete recovery of damaged or lost files. The WORM system adds a further element of confidence. Once a new file is stored on the main servers, it is copied to the WORM drive. The software managing this process checks the files on the main servers periodically, and if it finds a mis-match with the WORM drive (a lost or corrupted file on the main server), it pulls a copy of the WORM'ed file to the main server, thus "self-healing" the damage.

Replicability means that there is more than one copy of the file. One lives on the main servers at the Wellcome Trust offices, and the other is stored outside of London. If either server is damaged in a serious accident, one server remains to keep the content safe.

Scalability is important, as we are creating millions of images during the pilot project, and will create up to 30 million images over the longer term as we digitise the Wellcome Library. All the systems that are in place must be able to increase capacity - both in terms of hardware and processing software. The system we selected is scalable – we simply need to add storage “bricks” and “racks” (additional hardware), and processing units to manage that additional hardware, as our data store grows over time.

Speed of access is also key. In order to keep our storage footprint as small as possible, we are using one file as both the master archive file (preserved in the long-term, and backed-up to the WORM) and as the dissemination file. Our DAM, SDB, will maki copies of the archived files available to the front-end delivery system as the user requests images via the Wellcome Digital Library (or via back-end administrative systems), and this must be handled as quickly as possible – something RAID5 is particularly well suited for.

Preserving our digital assets #1 – SDB4

2011-07-04T10:14:00.001+01:00

The successful long term management of digital assets is a key concern for us as we build our digital resource. Until recently, we did not have a dedicated system in use for storing and managing master files for any digitised content, and our file backup system was not idea for dealing with large sets of data. Files were managed via simple filesharing on dedicated storage servers. Backups were only created for some content, and even then the backups were not permanent.

When the Wellcome Digital Library was initiated it quickly became clear that we needed a dedicated system to manage our digital masters - something that was scalable, robust, and could handle all the digital formats we create or procure - including born digital material. We also needed a secure storage system with offsite, permanent backup capability. This blog post describes our digital asset management system; a future blog post will provide details of our secure storage solution.

We already had an existing digital asset management system in place to manage born digital archives: Safety Deposit Box (SDB), developed by Tessella. This system incorporates a suite of tools designed to manage and preserve digital files. It provides a context in which administrative and descriptive metadata is associated with all ingested content. SDB can be combined with tools that can "migrate" files from one format to another to counter format obsolescence and therefore ensure the longevity of the data. In other words, when Word 2010 is no longer supported by Microsoft, SDB can help migrate these files to a current format that is supported by software available at that time.

However, in order to manage preservation of large sets of digitised content, SDB “out of the box” was not entirely able to meet the needs of the Wellcome Digital Library. We carried out a Feasibility Study in 2010 to determine whether it would be possible to use a modified version of the system. The research and prototype system we commissioned proved that it was indeed feasible to use SDB with certain software extensions.

In the spring of 2011, we commissioned Tessella to extend and install the newest version of the software (SDB4). This work is now complete and in the testing phase.

Key preservation functionality that SDB4 provides is listed here (some of this was “out of the box”, some were developed as extensions to the core system or as modules):

Automated ingest: automated SDB workflow to create and ingest a “submission information package” (SIP) – a bundle of content files and metadata forming a complete “object”

Multiple manifestations: ability to associate all the different manifestations of an object together (e.g. master video file, broadband versions, narrowband versions, transcripts, etc.)

SQL Server database: stores and indexes administrative and descriptive metadata describing objects stored in SDB

Characterisation: use of the JHOVE, DROID and PRONOM characterisation tools to extract essential technical information about digital files to be stored in the database

Format Migration: SDB4 builds on the PLANETS framework to support preservation planning for mitigation of obsolescence

Integrity checking: creates and stores a unique SHA-1 hash code for each file that can be used to test validity of the file over time

Provide access to content: delivers content to external systems using an API

Automatic export of administrative metadata: allows us to store metadata such as unique SDB identifiers in our workflow system in order to deliver files to the user

Administrative interface: allows administrators access to the content and the database, and to generate reports

This is by no means a complete description of SDB’s functionality, but the above provides a flavour of the most important features for long-term asset management and access to master files. Much of the development work that was done to extend SDB4 is likely to have an application for other users, not just the Wellcome Digital Library. Where possible, extensions are generic – although system requests/commands and metadata mappings may be highly specific to us.

We hope that our efforts in specifying our own needs will benefit other SDB users who see the value in long term preservation for digitised materials, and the efficiencies of combining born digital and digitised content into a single system strategy.

In future, SDB will interface with our workflow system and the image/media servers employed by our digital delivery system. These latter two systems have not yet been implemented.

Arabic manuscripts online - Fihrist launch

2011-04-04T09:54:00.017+01:00

Fihrist, an online catalogue for Islamic manuscripts held at the Bodleian and Cambridge University Library (CUL), was launched on 28 March. Developed by the the OCIMCO (Oxford & Cambridge Islamic Manuscripts Catalogue Online) project, the catalogue uses the TEI/XML schema created as part of the WAMCP (Wellcome Arabic Manuscript Cataloguing Partnership) project to structure the descriptions and to provide for future enhancement of those records.

Fihrist contains around 10,000 catalogue entries, retroconverted from hard copy lists and card catalogues at the Bodleian and CUL, providing an integrated online search tool for these large collections. Searching for a manuscript or a work shows you a list of works contained in the relevent manuscript, with separate descriptions provided for each work. The catalogue records are currently very brief, but the OCIMCO project "will eventually provide detailed manuscript descriptions that will include digital representations of the manuscripts themselves. The TEI/XML schema provides an extensible framework that will allow for these future enhancements" (About us).

The project recieved funding from the JISC’s Digital Resources for Islamic Studies programme - a scheme that also part-funded the WAMCP project.

At the Fihrist launch at Clare College, Cambridge, a series of presentations from the OCIMCO project managers and staff provided further details of the collections and the technical implementation of the project; a presentation from the JISC on the Islamic Studies programme provided background to the funding scheme; a joint presentation was given by Richard Aspin and Nikolai Serikoff of the Wellcome Library, and Gerhard Brey of King's College London on the WAMCP project; and we heard about the like-minded Yale/SOAS Islamic Manuscript Gallery.

The day then turned to a follow up project that Oxford is carrying out over the summer to create a union catalogue of Islamic manuscript catalogues (the Islamic Studies Gateway). This project was recently awarded JISC funding, and will seek to develop Fihrist further to "provide cross searching of existing online manuscript resources ... that at present do not have a significant internet presence."

"In addition to creating the gateway itself, the aim of the Fihrist is to create a sustainable user community of Islamic manuscript metadata standards and cataloguing tools to ensure a long term commitment by stakeholders to supporting and developing the Fihrist beyond the lifetime of the project."

The WAMCP partnership (comprising the Wellcome Library, Bibliotheca Alexandrina, and King's College London) will collaborate on this project by providing open access to metadata for 500 manuscripts. This metadata, with digitised manuscripts, is expected to be publically available online via the WAMCP website from Summer 2011.

The problem with orphans…

2011-03-23T14:30:00.007+00:00

Sooner or later, anyone planning to digitise works created in the last hundred years or so has to face the problem of what to do with those which are still in copyright but for which no copyright holder can be traced: so-called "orphan works". On 17 March, three of us travelled a few hundred yards down the road to attend a workshop on this very issue. The workshop was arranged as part of the "Who owns the Orphans?" project from the Beyond Text programme, sponsored by the UK Arts and Humanities Research Council. The morning focused on the highlights of the interim report which has been produced while the afternoon was given over to panel discussions on different aspects of the problem. The attendees came from a range of perspectives, including representatives from libraries, archives, museums and galleries, as well as creators of content – authors and photographers – who brought a different angle to the discussions.

The day began with an admission that it was tricky even just to define orphan works, and it quickly became evident that there is no easy solution to the problem. Various approaches were proposed, including compulsory registration and blanket licensing schemes, but all have their drawbacks and none really address the problem of the historical orphan works which exist in all collections. Many institutions admitted to taking a risk-managed approach to orphan works in order to maximise what could be made available to the public online. The sense from the participants was that as long as the material in question was not of high commercial value, then making it available after a due diligence search for the copyright holder had drawn a blank might be infringing copyright but was preferable to not making these cultural works widely available.

Compulsory registration of creative works is already the norm for the music industry, hence they do not have the same problems with orphan works faced by much of the cultural sector. However, imposing compulsory registration on other creators such as authors and photographers could be more difficult: the records of copyright registration which ended in 1912 show that few people actually bothered to register their work last time compulsory registration existed. In any case, this still leaves the problem of tracing subsequent copyright holders after the death of the original creator and it does not alter the position of historical orphan works.

Licensing schemes are favoured by the EU, but they also have their problems, both for authors who wish to retain control of their work and for cultural institutions who are reluctant to pay collecting societies to license works where they have no contact with the author or their heirs. Some collecting societies might be happy to license orphan works, setting aside money to pay out in the event of a copyright holder making themselves known. However, although this would in effect act to indemnify the institution who licenses the material, it was pointed out that this would really be no different to taking out insurance.

While no-one wants to infringe copyright, if a copyright holder cannot be traced after a due diligence search has taken place, most people working in the cultural sector consider that making material available to the public online is part of their obligation as guardians of the material. Some orphan works may have low commercial value, or their value may come from their position as part of a whole collection. It was also noted that people who have inherited copyright in these works are often unaware that they own rights in items, but once the work becomes commercially valuable, these rightsholders magically appear.

It seems likely that current practices will continue for the time being, due to a lack of viable alternatives. However, the outcome of the Hargreaves Review may have an impact on how orphan works are dealt with in the future, and perhaps the final report from the “Who Owns the Orphans?” project will be able to suggest practical solutions to what is acknowledged to be an issue across the cultural sector.

Wellcome Library joins Strategic Content Alliance

2011-02-28T19:27:00.003+00:00

The Wellcome Library is pleased to announce that we’re the latest partners to join the Strategic Content Alliance. You can read the press release here.

Why join? Well, as our digitisation programme gathers pace, one of the key issues we face is making sure that our plans are not developed in isolation. We’re in a lucky position: unlike many libraries and archives, we have the financial resources of the Wellcome Trust to draw upon to support our work. But one of our aims is to make sure that our digitised content is made available to – and is useful to – as wide a range of users as possible, and to make sure that we achieve best value for, and embody best practice in, our programme.

The aim of the Strategic Content Alliance is to bring together key public and not-for-profit sector organisations involved in the creation, management and exploitation of digital content for the common good. By facilitating high-level discussion between partners, the SCA aims to ensure the maximum return on public sector and charitable investment in digitisation. Being part of the Strategic Content Alliance enables us to keep abreast of what others – including the BBC, JISC and the British Library – are up to, and to identify and pursue opportunities for partnership. Some of these are around the sharing of expertise (such as Digipedia, a pilot project developed by SCA), tools to make digital content accessible, and business models that ensure the sustainability of the content we create and the repositories that house it (see for example this post on the SCA blog about work being undertaken by Ithaka S+R for the SCA). Others are about aggregating content and finding ways to complement each others’ work (as with the recent JISC meeting on the First World War Commemoration). The SCA also acts as a coordinating body to provide sound business intelligence and ensure effective advocacy on issues that cut across our shared interests, such as copyright and orphan works. Nor is the benefit limited to our in-house programmes: our involvement with the SCA will also help to inform the Wellcome Trust’s strategy towards the funding of digitisation projects elsewhere.

Workflow Tools Workshop in Bath, 30 November 2010

2010-12-06T08:33:00.004+00:00

On a freezing cold November day I travelled down to Bath to attend a one-day event run by UKOLN as part of the DevCSI program. Mahendra Mahey, Research Office with UKOLN, was our host for the day. What may seem like an esoteric activity for the few is actually a fascinating subject for those with a logical and ordered approach to things. We’ve used some business modelling techniques in developing the basic model for our Workflow Tracking System so the workshop was very relevant to us.

The aim of the day was to introduce a small range of open source workflow tools to a wider audience and to set out the role and value of using software to model business processes. The day consisted of a series of presentations and product demonstrations.

The event brought together vendors, users and project managers who are interested in learning more about workflow and business process modelling tools. In some way or another all attendees were involved in planning or developing workflows within their own institutions.

The day started with a presentation by Tammo van Lessen, who gave an overview of the basics of business process modelling, languages such as BPEL (Business process Execution Language) and BPMN 2.0 (Business Process Model and Notation). Business process modelling is just a formal way of describing the processes that a business uses to do what it does. It can be used to describe each step in a process, the interdependencies between those steps and sets out a sequence in which steps must occur. Tammo talked about the history of business process modelling and how the tools have become more interactive and how they can now be used to directly build web based interactive services.

Amol Vedak gave a demonstration of the Intalio Business Process Management System and discussed how it could be used to as a tool for business transformation. Intalio is designed to combine the skills of business analysts and IT people. Groups of experts can use their respective skills to design and build business processes using a common language and set of tools. The simplicity of the tools means that processes can be quickly designed, tested and implemented.

The WS02 Business Process Server, discussed by Paul Fremantle, is an open-source BPEL. Paul demonstrated the ease with which a workflow could be built diagrammatically. At the same time the software builds real web services that can be used by people or systems and integrated into an organisations business.

Of particular interest was the Taverna suite of scientific workflow tools, demonstrated by Stian Soiland-Reyes. This tool can be used to build the workflows that query public data sources. Taverna can use local tools or third party scripts to query services such as EBI’s BioMart. Taverna recognises that many scientists have data or their own tool sets that are held locally and it provides a means by which these can be used by others.

At the end of the day there was a series of ‘Lightening talks’ given by some of the workshop attendees looking at workflow related projects that they are currently working on. This provided a set of practical real-world examples of how business process modelling could be used.

The lessons of the day were quite clear. Using business process modelling tools is an effective way to help us better understand the processes that we use in our business, and it can help us design better more efficient processes. In better understanding how we do what we do we can better understand the risks to our business, design more efficient processes and do so relatively quickly and easily. We’re also able to use these techniques to run ‘what if’ scenarios and to test processes under different scenarios. All of which can be done in a virtual world before implementation.

Wellcome Library releases an ITT for a Workflow Tracking System

2010-11-29T15:33:00.003+00:00

If you’ve been reading our blog regularly you’ll know about how the Library plans to transform itself into a groundbreaking digital resource, allowing access to much of the Library’s material in digital form.

As part of this program we’ve just released an ITT for a Workflow Tracking System (WTS). We’re looking for a system that will track and manage the processes around creating digital content – whether that content is digitised by us, digitised externally or born digital archival material- and automating that activity as much as possible.

Within the Library, staff who want to add content to our Digital Library will do so using the Workflow Tracking System. This means using the WTS to record that all digital content, e.g. digitised books or archival collections, has been created correctly, has had its descriptive metadata attached, is converted to JPEG2000 (or some other appropriate format) and is ingested into our digital object repository. The WTS will also create metadata encoding and transmission standard (METS) files. These will be used by the front end system to deliver digital content to our users.

Expressed simply, the WTS will play a critical central role in ensuring that all digital content that is destined for our Digital Library is created, quality controlled and ingested accurately and efficiently into the Library’s repository.

Wellcome Trust hosts JPEG 2000 seminar

2010-11-24T14:57:00.002+00:00

The JPEG 2000 for the Practitioner seminar was hosted by the Wellcome Trust on 16 November, 2010. This was organised by the Wellcome Library, and achieved a sell-out crowd of over 80 people.

The aim of the seminar was to look at specific case studies of JPEG 2000 use, to explain technical issues that have an impact on practical implementation of the format, and explore the context of how and why organisations might choose to use JPEG 2000. The Wellcome Library started investigating JPEG 2000 as a strategy for storing its archival master images in 2009, and has recently started converting its backlog of images into JPEG 2000.

The programme for the event was posted to our JPEG 2000 blog, and further blogs provide edited highlights of the varied and informative talks given on the day. All the presentations are available online, hosted by the Digital Preservation Coalition. The twitter stream from the day can be seen by searching the hashtag #jp2k10.

100,000 image milestone, and how we did it

2010-10-29T11:42:00.011+01:00

A significant milestone was reached this week, with 100,000 images of Crick papers photographed. The Crick collection, previously described, includes around 285,000 images - over half the total amount to be digitised as part of the Archives digitisation project over a period of 20 months by Laurie Auchterlonie and Tom Cox in the Imaging department.

Laurie and Tom use Canon Mark II digital SLR cameras mounted to columns attached to two copy stands. These allow the cameras to be automatically raised and lowered according to the size of the items being digitised. This is a fairly typical imaging set-up for this type of material. However, with a project this large, it is important to ensure that the minimum amount of time is taken to photograph, edit and manage the images. In spring 2010, the photographers spent time developing a workflow that would allow them to digitise the highest number of items possible, whilst not compromising on quality or care in handling. It was in fact found that time-saving measures actually resulted in higher quality images, and minimised the amount of handling required.

For example, "live view" screens allow the photographers to easily see and adjust the alignment of each item on the copy stand, and the degree to which it fits the frame of view. This saves time as the photographers do not have to look through the viewfinder (difficult when the camera is 6 or more feet above the ground), or take multiple shots to get it right. Post-processing work has also been almost completely eliminated, as has the need to reshoot items at a later date.

Purchasing higher columns limited the number of times lenses had to be changed. Larger items require the cameras to be raised quite high, and if the column isn't high enough, the photographer has to change to a different, shorter or wide-angle lens. The flexibility built into the workflow by these measures is highly advantageous when dealing with heterogeneous materials such as personal archives.

Other aspects of the workflow that had to be specifically tailored to archival collections was the storage and foldering of images so that they could easily be found and identified. Using the existing archive catalogue hierarchies, the foldering system allows the user to pinpoint the exact file or item to be viewed to create copies for users, or to carry out QA against the original items (a sample of images is checked against the originals by Julia Nurse, who prepares the items before photography, to ensure that filenaming is accurate and that items aren't being missed). Eventually, these folders will be rendered obsolete, as we implement a digital asset management system that will restructure the archive storage of our images on ingest. But it is important not to underestimate the need to access images during the pre-ingest process of digitisation and QA. And if you do not have a digital asset management system, it is even more important that ease of access is factored in from an early stage.

A bit of preparation and testing makes a huge difference when setting up a new workflow. Even a minute saved per item means a large overall time saving when spread over hundreds of thousands of images.

Papers, papers and yet more papers … preparing the Crick Archive for digitisation

2010-10-13T09:00:00.002+01:00

When first faced with preparing around 300 boxes of Francis Crick’s personal papers for digitisation, I have to confess my heart sank. A far cry from the last very visual digitisation project of 3000 AIDS posters, I was daunted, not only by the very different content of this collection, but the sheer size of it – an estimated half a million items this time. How wrong I was. I feel privileged to have been given the opportunity to delve into one of the most incredible minds of our lifetime.

Although we tend to associate him only with his (and Jim Watson’s) discovery of the double-helix sequence of DNA – and there is plenty of fascinating correspondence within the archive related to this - it is his research on the mind and consciousness in his latter years that is truly ‘astonishing’ as he would put it. Through his endless correspondence with both fellow scientists and the general public, we get a real sense of his probing analysis of what makes our brains tick.

It is very easy to get side-tracked from such a collection but I have to remember that my main task is to ensure the papers are in a suitable condition for the photographers to shoot: a daily scour of the collection is required to remove existing staples and flatten pages but occasionally a conservator is required. For example, Crick’s heavily folded (since 1955) tracing sketches and calculations of Collagen Long Spacings required specialist equipment to flatten out.

Once a particular batch has been checked, data spreadsheets are then produced for the photographers so that they know what to expect in each box – included in this data is an estimate of the percentage of OCR’able (Optical Character Recognition) text, a record of the current location of a particular batch and notes for the archivists’ attention. While doing this I cannot help but siphon off particularly interesting information which has and will continue to be used for publicity about the project – see the recent BBC Audio Slideshow.

Further blogs providing updates on the digitisation project will follow in due course.

Top image: Crick's sketch of genetic code, 1965 (PP/CRI/E/1/13/10)
Bottom image: Francis Crick lecturing at Cambridge University (PP/CRI/A/1/2/9)

Preparing and planning a large archives digitisation project

2010-10-01T11:50:00.007+01:00

Archives digitisation is currently underway in our Imaging Studio, with two full-time members and two part-time members of Library staff dedicated to preparing and digitising the items. We will talk more specifically about the work being carried out on these materials on this blog in the near future, but first we present an introduction to the setup and planning of the project.

Once the theme was chosen (Modern Genetics and its Foundations), and relevant collections identified (see previous blog post), we realised that we had quite a large job on our hands. The scope of the project was bigger than anything we had done before: 620 boxes of material, containing around 800 pages each, adds up to around half a million pages to be digitised.

Based on a series of tests, we estimated the project would take 2 years to complete – starting with the preparation of the material in advance, with photography coming into play a few months later. Two full-time staff would be focused on imaging the material, with two part-time member of staff preparing, tracking and assessing the items.

There are a range of logistical issues to bear in mind when planning and starting up a project of this nature. The boxes are stored in the basement stores, and had to be retrieved for a period of some months while the material was being worked on. We divided the collections into batches of a size that could be imaged in a period of 4-6 weeks and retrieve and return each batch as a unit, tracking all movements on a spreadsheet. The tracking spreadsheet also records information such as location of each box in the batch, notes from the preparer for the archivists, photographers, and/or conservation staff, and the percentage of items in each box that can be OCR’d among other things.

We put a notice on our website of the entire schedule of archives to be digitised, so readers could see at a glance what would be unavailable and when. The catalogue records are also amended to show where material cannot be reserved. Each time a batch is retrieved, checked out, checked in, or dates altered, this has an impact on the website and two different cataloguing systems (the Archives and Manuscripts Catalogue, and the Library Catalogue), so communication with the departments responsible for retrieval and metadata was key.

The preparation staff were trained in advance by the conservation team so they could carry out basic stabilisation and first-aid work on the materials if required for digitisation. The photographers ran multiple tests on different equipment and with different cameras to ensure the workflow was efficient and appropriate to the formats of the material, the anticipated end use of the material, and to ensure proper QA could be accommodated. Preparation and imaging takes place in the Imaging Studio - ensuring that all staff are in close proximity and able to communicate easily with each other. The Imaging Studio was refitted with desks, shelving and equipment to make sure all the boxes in process at any one time could be accommodated. A further planning issue was in determining how to assess and record different levels of sensitivity of information contained in the archives. We are currently developing a policy for access to archives that takes account of online display, and this has informed the workflow for assessment.

This project required liaison between several different departments and stakeholders in the Library in order to set up a suitable workflow. In future, we hope that workflow issues will be streamlined further by procuring a Workflow Tracking System that will serve to centralise tracking and monitoring of all digitisation projects. We anticipate that this pilot project will enable us in future to plan effectively for much larger digitisation projects as we work towards the digitisation of all suitable material held in the Wellcome Library.