Wellcome Digital Library: 2011

Monday, November 28, 2011

Guest blog post: Digitising early printed books at the Wellcome Library

Karine Larose and Jeremy Uhl are working on the Early European Books digitisation project at the Wellcome Library. They are empolyees of Diadeis, a digitisation company based in France, and work on site alongside the rest of the Wellcome's digitisation team. In this guest post, Karine and Jeremy explain what they do here at the Wellcome Library:

Hello, Karine and Jeremy here from Diadeis. We are the production team currently working on the digitization project for ProQuest at the Wellcome library. The following will be a sort of 'day in the life' of our department. We hope you enjoy it and would love any comments, questions, or suggestions.

Each day, when we start, new books are delivered to us by Matt Brack, the Digitisation Support Officer, for digitisation. We then register them in the database (also known as the check-in process). First we assess the books to determine if they are suitable for scanning. We might need to refuse a book if

the book is too big or thick for the scanning machines to handle (usually higher than 42 cm and wider than 30 cm)
the book cannot open more than 100 degree (later in the project these books will be scanned using different equipment)
The book is too fragile and we feel that we cannot scan it without causing more damage.

Also during the check-in process we might notice other minor problems with the books but not so severe as to refuse them. In that case we usually put them on hold. An example would be if most of the pages are stuck/glued together. ‘Problem books’ on hold are placed on a separate shelf, and we later consult with Matt as to what action has to be taken. Simultaneously, the books we scanned on the previous days are given back to Matt and we update the database to show that they have been returned to the library.

Before scanning the pages of the books, the spine and edges are photographed first. This step is to show the condition of the book before we start the scanning. We do our best during the scanning process to be careful with the books, to return them in the state they were delivered. The books are scanned using two main scanners, one for restricted angles and one for opened books. A few problems we sometimes have when scanning books are:

Huge fold-outs that cannot be scanned with the current equipment
missing pages
pieces of a book coming apart (usually the spine)

Our most common problem is dust! This includes tiny book pieces, occasionally hair, once we even found tiny bits of cereals in a book. There are lots of surprises and we never know what we may find in between pages of a book. Sadly we have yet to find any money in any of the books ;). Several times during the day we have to clean the glass panel; for that we use a special cloth and a vacuum cleaner. The archive office kindly supplied us with alcohol wipes which has a disinfection effect. Also because some of the books have leather covers, our hands can become oily after handling them and we have to wash our hands regularly.

Once the scanning day is over, we pack all the images in a packaging software and send them off to the indexing team for quality control, cropping and indexing.

Working on this project at the Wellcome library has been fun and exciting because we get to see rare books every day and provide digital access to them to people all over the world. We are pleased to be given the chance to explain our daily routine to the followers of this blog and look forward to any comments they might have.

We would like to thank Matt for being so flexible with our timetable and bringing us lots of books everyday and Christy for her support and giving us the opportunity to write for the blog. For those of you who work for the Wellcome library do come along and say "hello" to us, the room may be dark but I assure you we are here!

Authors: Karine Larose and Jeremy Uhl

NB: The first batch of 400 books can now be seen on the EEB website (do a Quick Search for Wellcome). These are freely available to Wellcome Library members and everyone in the UK.

Wednesday, September 21, 2011

Preserving our digital assets #2 - WORM storage

This post follows on from my earlier description of our DAM system (Safety Deposit Box or SDB), which manages long-term preservation and access to our digital assets. Here we turn to the back-end of the back-end: storage.

There are a number of requirements that must be met to safely storing digital assets. The storage solution must be:

Secure (behind a firewall)
Robust (able to manage points of failure in the disks)
Replicable (multiple copies on multiple sites)
Scalable (able to handle tens of millions of files)
Quick to access (the archived files are also used for delivery)

After considering different systems and suppliers – including robotic tape back-up – we settled on a solution that gives us the confidence we need for long-term preservation, and fits well with the Trust’s existing storage infrastructure. Our existing storage suppliers, Pillar Data Systems, have extended the existing RAID5 enterprise storage system used for all the Wellcome Trust’s business needs by incorporating a “Write Once, Read Many” (WORM) back-up storage server for use by the Wellcome Digital Library. Associated management software copies files from the main storage server to the WORM, and monitors the main server for file errors that can be “healed” using the WORM copy.

To explain this in a bit more detail, related to our primary requirements:

Security means that only authorised users or systems are able to access the files, and that unauthorised deletions or changes are guarded against. Locking master files behind a firewall is the main form of defence from unauthorised external access (i.e. hackers, or the ability to download files by "guessing" the network path and filenames). However, we still faced the prospect of file deletion, changes or corruption due to system failures, and accidental or malicious actions by otherwise authorised users. In order to eliminate this possibility, the Trust IT Department recommended permanent, on-line WORM storage as our back-up solution. Files stored on the WORM can be accessed, but they cannot be overwritten or deleted. This means that we have a permanent back-up of every digital file that cannot be tampered with.

Robustness is tied up with the WORM system - although the RAID5 storage we normally use is highly robust. It distributes the bit stream of digital files in such a way that points of failure generally do not damage the entire file, allowing it to be reconstituted, while short-term back-ups allows complete recovery of damaged or lost files. The WORM system adds a further element of confidence. Once a new file is stored on the main servers, it is copied to the WORM drive. The software managing this process checks the files on the main servers periodically, and if it finds a mis-match with the WORM drive (a lost or corrupted file on the main server), it pulls a copy of the WORM'ed file to the main server, thus "self-healing" the damage.

Replicability means that there is more than one copy of the file. One lives on the main servers at the Wellcome Trust offices, and the other is stored outside of London. If either server is damaged in a serious accident, one server remains to keep the content safe.

Scalability is important, as we are creating millions of images during the pilot project, and will create up to 30 million images over the longer term as we digitise the Wellcome Library. All the systems that are in place must be able to increase capacity - both in terms of hardware and processing software. The system we selected is scalable – we simply need to add storage “bricks” and “racks” (additional hardware), and processing units to manage that additional hardware, as our data store grows over time.

Speed of access is also key. In order to keep our storage footprint as small as possible, we are using one file as both the master archive file (preserved in the long-term, and backed-up to the WORM) and as the dissemination file. Our DAM, SDB, will maki copies of the archived files available to the front-end delivery system as the user requests images via the Wellcome Digital Library (or via back-end administrative systems), and this must be handled as quickly as possible – something RAID5 is particularly well suited for.

Monday, July 4, 2011

Preserving our digital assets #1 – SDB4

The successful long term management of digital assets is a key concern for us as we build our digital resource. Until recently, we did not have a dedicated system in use for storing and managing master files for any digitised content, and our file backup system was not idea for dealing with large sets of data. Files were managed via simple filesharing on dedicated storage servers. Backups were only created for some content, and even then the backups were not permanent.

When the Wellcome Digital Library was initiated it quickly became clear that we needed a dedicated system to manage our digital masters - something that was scalable, robust, and could handle all the digital formats we create or procure - including born digital material. We also needed a secure storage system with offsite, permanent backup capability. This blog post describes our digital asset management system; a future blog post will provide details of our secure storage solution.

We already had an existing digital asset management system in place to manage born digital archives: Safety Deposit Box (SDB), developed by Tessella. This system incorporates a suite of tools designed to manage and preserve digital files. It provides a context in which administrative and descriptive metadata is associated with all ingested content. SDB can be combined with tools that can "migrate" files from one format to another to counter format obsolescence and therefore ensure the longevity of the data. In other words, when Word 2010 is no longer supported by Microsoft, SDB can help migrate these files to a current format that is supported by software available at that time.

However, in order to manage preservation of large sets of digitised content, SDB “out of the box” was not entirely able to meet the needs of the Wellcome Digital Library. We carried out a Feasibility Study in 2010 to determine whether it would be possible to use a modified version of the system. The research and prototype system we commissioned proved that it was indeed feasible to use SDB with certain software extensions.

In the spring of 2011, we commissioned Tessella to extend and install the newest version of the software (SDB4). This work is now complete and in the testing phase.

Key preservation functionality that SDB4 provides is listed here (some of this was “out of the box”, some were developed as extensions to the core system or as modules):

Automated ingest: automated SDB workflow to create and ingest a “submission information package” (SIP) – a bundle of content files and metadata forming a complete “object”

Multiple manifestations: ability to associate all the different manifestations of an object together (e.g. master video file, broadband versions, narrowband versions, transcripts, etc.)

SQL Server database: stores and indexes administrative and descriptive metadata describing objects stored in SDB

Characterisation: use of the JHOVE, DROID and PRONOM characterisation tools to extract essential technical information about digital files to be stored in the database

Format Migration: SDB4 builds on the PLANETS framework to support preservation planning for mitigation of obsolescence

Integrity checking: creates and stores a unique SHA-1 hash code for each file that can be used to test validity of the file over time

Provide access to content: delivers content to external systems using an API

Automatic export of administrative metadata: allows us to store metadata such as unique SDB identifiers in our workflow system in order to deliver files to the user

Administrative interface: allows administrators access to the content and the database, and to generate reports

This is by no means a complete description of SDB’s functionality, but the above provides a flavour of the most important features for long-term asset management and access to master files. Much of the development work that was done to extend SDB4 is likely to have an application for other users, not just the Wellcome Digital Library. Where possible, extensions are generic – although system requests/commands and metadata mappings may be highly specific to us.

We hope that our efforts in specifying our own needs will benefit other SDB users who see the value in long term preservation for digitised materials, and the efficiencies of combining born digital and digitised content into a single system strategy.

In future, SDB will interface with our workflow system and the image/media servers employed by our digital delivery system. These latter two systems have not yet been implemented.

Monday, April 4, 2011

Arabic manuscripts online - Fihrist launch

Fihrist, an online catalogue for Islamic manuscripts held at the Bodleian and Cambridge University Library (CUL), was launched on 28 March. Developed by the the OCIMCO (Oxford & Cambridge Islamic Manuscripts Catalogue Online) project, the catalogue uses the TEI/XML schema created as part of the WAMCP (Wellcome Arabic Manuscript Cataloguing Partnership) project to structure the descriptions and to provide for future enhancement of those records.

Fihrist contains around 10,000 catalogue entries, retroconverted from hard copy lists and card catalogues at the Bodleian and CUL, providing an integrated online search tool for these large collections. Searching for a manuscript or a work shows you a list of works contained in the relevent manuscript, with separate descriptions provided for each work. The catalogue records are currently very brief, but the OCIMCO project "will eventually provide detailed manuscript descriptions that will include digital representations of the manuscripts themselves. The TEI/XML schema provides an extensible framework that will allow for these future enhancements" (About us).

The project recieved funding from the JISC’s Digital Resources for Islamic Studies programme - a scheme that also part-funded the WAMCP project.

At the Fihrist launch at Clare College, Cambridge, a series of presentations from the OCIMCO project managers and staff provided further details of the collections and the technical implementation of the project; a presentation from the JISC on the Islamic Studies programme provided background to the funding scheme; a joint presentation was given by Richard Aspin and Nikolai Serikoff of the Wellcome Library, and Gerhard Brey of King's College London on the WAMCP project; and we heard about the like-minded Yale/SOAS Islamic Manuscript Gallery.

The day then turned to a follow up project that Oxford is carrying out over the summer to create a union catalogue of Islamic manuscript catalogues (the Islamic Studies Gateway). This project was recently awarded JISC funding, and will seek to develop Fihrist further to "provide cross searching of existing online manuscript resources ... that at present do not have a significant internet presence."

"In addition to creating the gateway itself, the aim of the Fihrist is to create a sustainable user community of Islamic manuscript metadata standards and cataloguing tools to ensure a long term commitment by stakeholders to supporting and developing the Fihrist beyond the lifetime of the project."

The WAMCP partnership (comprising the Wellcome Library, Bibliotheca Alexandrina, and King's College London) will collaborate on this project by providing open access to metadata for 500 manuscripts. This metadata, with digitised manuscripts, is expected to be publically available online via the WAMCP website from Summer 2011.

Wednesday, March 23, 2011

The problem with orphans…

Sooner or later, anyone planning to digitise works created in the last hundred years or so has to face the problem of what to do with those which are still in copyright but for which no copyright holder can be traced: so-called "orphan works". On 17 March, three of us travelled a few hundred yards down the road to attend a workshop on this very issue. The workshop was arranged as part of the "Who owns the Orphans?" project from the Beyond Text programme, sponsored by the UK Arts and Humanities Research Council. The morning focused on the highlights of the interim report which has been produced while the afternoon was given over to panel discussions on different aspects of the problem. The attendees came from a range of perspectives, including representatives from libraries, archives, museums and galleries, as well as creators of content – authors and photographers – who brought a different angle to the discussions.

The day began with an admission that it was tricky even just to define orphan works, and it quickly became evident that there is no easy solution to the problem. Various approaches were proposed, including compulsory registration and blanket licensing schemes, but all have their drawbacks and none really address the problem of the historical orphan works which exist in all collections. Many institutions admitted to taking a risk-managed approach to orphan works in order to maximise what could be made available to the public online. The sense from the participants was that as long as the material in question was not of high commercial value, then making it available after a due diligence search for the copyright holder had drawn a blank might be infringing copyright but was preferable to not making these cultural works widely available.

Compulsory registration of creative works is already the norm for the music industry, hence they do not have the same problems with orphan works faced by much of the cultural sector. However, imposing compulsory registration on other creators such as authors and photographers could be more difficult: the records of copyright registration which ended in 1912 show that few people actually bothered to register their work last time compulsory registration existed. In any case, this still leaves the problem of tracing subsequent copyright holders after the death of the original creator and it does not alter the position of historical orphan works.

Licensing schemes are favoured by the EU, but they also have their problems, both for authors who wish to retain control of their work and for cultural institutions who are reluctant to pay collecting societies to license works where they have no contact with the author or their heirs. Some collecting societies might be happy to license orphan works, setting aside money to pay out in the event of a copyright holder making themselves known. However, although this would in effect act to indemnify the institution who licenses the material, it was pointed out that this would really be no different to taking out insurance.

While no-one wants to infringe copyright, if a copyright holder cannot be traced after a due diligence search has taken place, most people working in the cultural sector consider that making material available to the public online is part of their obligation as guardians of the material. Some orphan works may have low commercial value, or their value may come from their position as part of a whole collection. It was also noted that people who have inherited copyright in these works are often unaware that they own rights in items, but once the work becomes commercially valuable, these rightsholders magically appear.

It seems likely that current practices will continue for the time being, due to a lack of viable alternatives. However, the outcome of the Hargreaves Review may have an impact on how orphan works are dealt with in the future, and perhaps the final report from the “Who Owns the Orphans?” project will be able to suggest practical solutions to what is acknowledged to be an issue across the cultural sector.

Monday, February 28, 2011

Wellcome Library joins Strategic Content Alliance

The Wellcome Library is pleased to announce that we’re the latest partners to join the Strategic Content Alliance. You can read the press release here.

Why join? Well, as our digitisation programme gathers pace, one of the key issues we face is making sure that our plans are not developed in isolation. We’re in a lucky position: unlike many libraries and archives, we have the financial resources of the Wellcome Trust to draw upon to support our work. But one of our aims is to make sure that our digitised content is made available to – and is useful to – as wide a range of users as possible, and to make sure that we achieve best value for, and embody best practice in, our programme.

The aim of the Strategic Content Alliance is to bring together key public and not-for-profit sector organisations involved in the creation, management and exploitation of digital content for the common good. By facilitating high-level discussion between partners, the SCA aims to ensure the maximum return on public sector and charitable investment in digitisation. Being part of the Strategic Content Alliance enables us to keep abreast of what others – including the BBC, JISC and the British Library – are up to, and to identify and pursue opportunities for partnership. Some of these are around the sharing of expertise (such as Digipedia, a pilot project developed by SCA), tools to make digital content accessible, and business models that ensure the sustainability of the content we create and the repositories that house it (see for example this post on the SCA blog about work being undertaken by Ithaka S+R for the SCA). Others are about aggregating content and finding ways to complement each others’ work (as with the recent JISC meeting on the First World War Commemoration). The SCA also acts as a coordinating body to provide sound business intelligence and ensure effective advocacy on issues that cut across our shared interests, such as copyright and orphan works. Nor is the benefit limited to our in-house programmes: our involvement with the SCA will also help to inform the Wellcome Trust’s strategy towards the funding of digitisation projects elsewhere.