Wellcome Digital Library: 2010

Monday, December 6, 2010

Workflow Tools Workshop in Bath, 30 November 2010

On a freezing cold November day I travelled down to Bath to attend a one-day event run by UKOLN as part of the DevCSI program. Mahendra Mahey, Research Office with UKOLN, was our host for the day. What may seem like an esoteric activity for the few is actually a fascinating subject for those with a logical and ordered approach to things. We’ve used some business modelling techniques in developing the basic model for our Workflow Tracking System so the workshop was very relevant to us.

The aim of the day was to introduce a small range of open source workflow tools to a wider audience and to set out the role and value of using software to model business processes. The day consisted of a series of presentations and product demonstrations.

The event brought together vendors, users and project managers who are interested in learning more about workflow and business process modelling tools. In some way or another all attendees were involved in planning or developing workflows within their own institutions.

The day started with a presentation by Tammo van Lessen, who gave an overview of the basics of business process modelling, languages such as BPEL (Business process Execution Language) and BPMN 2.0 (Business Process Model and Notation). Business process modelling is just a formal way of describing the processes that a business uses to do what it does. It can be used to describe each step in a process, the interdependencies between those steps and sets out a sequence in which steps must occur. Tammo talked about the history of business process modelling and how the tools have become more interactive and how they can now be used to directly build web based interactive services.

Amol Vedak gave a demonstration of the Intalio Business Process Management System and discussed how it could be used to as a tool for business transformation. Intalio is designed to combine the skills of business analysts and IT people. Groups of experts can use their respective skills to design and build business processes using a common language and set of tools. The simplicity of the tools means that processes can be quickly designed, tested and implemented.

The WS02 Business Process Server, discussed by Paul Fremantle, is an open-source BPEL. Paul demonstrated the ease with which a workflow could be built diagrammatically. At the same time the software builds real web services that can be used by people or systems and integrated into an organisations business.

Of particular interest was the Taverna suite of scientific workflow tools, demonstrated by Stian Soiland-Reyes. This tool can be used to build the workflows that query public data sources. Taverna can use local tools or third party scripts to query services such as EBI’s BioMart. Taverna recognises that many scientists have data or their own tool sets that are held locally and it provides a means by which these can be used by others.

At the end of the day there was a series of ‘Lightening talks’ given by some of the workshop attendees looking at workflow related projects that they are currently working on. This provided a set of practical real-world examples of how business process modelling could be used.

The lessons of the day were quite clear. Using business process modelling tools is an effective way to help us better understand the processes that we use in our business, and it can help us design better more efficient processes. In better understanding how we do what we do we can better understand the risks to our business, design more efficient processes and do so relatively quickly and easily. We’re also able to use these techniques to run ‘what if’ scenarios and to test processes under different scenarios. All of which can be done in a virtual world before implementation.

Monday, November 29, 2010

Wellcome Library releases an ITT for a Workflow Tracking System

If you’ve been reading our blog regularly you’ll know about how the Library plans to transform itself into a groundbreaking digital resource, allowing access to much of the Library’s material in digital form.

As part of this program we’ve just released an ITT for a Workflow Tracking System (WTS). We’re looking for a system that will track and manage the processes around creating digital content – whether that content is digitised by us, digitised externally or born digital archival material- and automating that activity as much as possible.

Within the Library, staff who want to add content to our Digital Library will do so using the Workflow Tracking System. This means using the WTS to record that all digital content, e.g. digitised books or archival collections, has been created correctly, has had its descriptive metadata attached, is converted to JPEG2000 (or some other appropriate format) and is ingested into our digital object repository. The WTS will also create metadata encoding and transmission standard (METS) files. These will be used by the front end system to deliver digital content to our users.

Expressed simply, the WTS will play a critical central role in ensuring that all digital content that is destined for our Digital Library is created, quality controlled and ingested accurately and efficiently into the Library’s repository.

Wednesday, November 24, 2010

Wellcome Trust hosts JPEG 2000 seminar

The JPEG 2000 for the Practitioner seminar was hosted by the Wellcome Trust on 16 November, 2010. This was organised by the Wellcome Library, and achieved a sell-out crowd of over 80 people.

The aim of the seminar was to look at specific case studies of JPEG 2000 use, to explain technical issues that have an impact on practical implementation of the format, and explore the context of how and why organisations might choose to use JPEG 2000. The Wellcome Library started investigating JPEG 2000 as a strategy for storing its archival master images in 2009, and has recently started converting its backlog of images into JPEG 2000.

The programme for the event was posted to our JPEG 2000 blog, and further blogs provide edited highlights of the varied and informative talks given on the day. All the presentations are available online, hosted by the Digital Preservation Coalition. The twitter stream from the day can be seen by searching the hashtag #jp2k10.

Friday, October 29, 2010

100,000 image milestone, and how we did it

A significant milestone was reached this week, with 100,000 images of Crick papers photographed. The Crick collection, previously described, includes around 285,000 images - over half the total amount to be digitised as part of the Archives digitisation project over a period of 20 months by Laurie Auchterlonie and Tom Cox in the Imaging department.

Laurie and Tom use Canon Mark II digital SLR cameras mounted to columns attached to two copy stands. These allow the cameras to be automatically raised and lowered according to the size of the items being digitised. This is a fairly typical imaging set-up for this type of material. However, with a project this large, it is important to ensure that the minimum amount of time is taken to photograph, edit and manage the images. In spring 2010, the photographers spent time developing a workflow that would allow them to digitise the highest number of items possible, whilst not compromising on quality or care in handling. It was in fact found that time-saving measures actually resulted in higher quality images, and minimised the amount of handling required.

For example, "live view" screens allow the photographers to easily see and adjust the alignment of each item on the copy stand, and the degree to which it fits the frame of view. This saves time as the photographers do not have to look through the viewfinder (difficult when the camera is 6 or more feet above the ground), or take multiple shots to get it right. Post-processing work has also been almost completely eliminated, as has the need to reshoot items at a later date.

Purchasing higher columns limited the number of times lenses had to be changed. Larger items require the cameras to be raised quite high, and if the column isn't high enough, the photographer has to change to a different, shorter or wide-angle lens. The flexibility built into the workflow by these measures is highly advantageous when dealing with heterogeneous materials such as personal archives.

Other aspects of the workflow that had to be specifically tailored to archival collections was the storage and foldering of images so that they could easily be found and identified. Using the existing archive catalogue hierarchies, the foldering system allows the user to pinpoint the exact file or item to be viewed to create copies for users, or to carry out QA against the original items (a sample of images is checked against the originals by Julia Nurse, who prepares the items before photography, to ensure that filenaming is accurate and that items aren't being missed). Eventually, these folders will be rendered obsolete, as we implement a digital asset management system that will restructure the archive storage of our images on ingest. But it is important not to underestimate the need to access images during the pre-ingest process of digitisation and QA. And if you do not have a digital asset management system, it is even more important that ease of access is factored in from an early stage.

A bit of preparation and testing makes a huge difference when setting up a new workflow. Even a minute saved per item means a large overall time saving when spread over hundreds of thousands of images.

Wednesday, October 13, 2010

Papers, papers and yet more papers … preparing the Crick Archive for digitisation

When first faced with preparing around 300 boxes of Francis Crick’s personal papers for digitisation, I have to confess my heart sank. A far cry from the last very visual digitisation project of 3000 AIDS posters, I was daunted, not only by the very different content of this collection, but the sheer size of it – an estimated half a million items this time. How wrong I was. I feel privileged to have been given the opportunity to delve into one of the most incredible minds of our lifetime.

Although we tend to associate him only with his (and Jim Watson’s) discovery of the double-helix sequence of DNA – and there is plenty of fascinating correspondence within the archive related to this - it is his research on the mind and consciousness in his latter years that is truly ‘astonishing’ as he would put it. Through his endless correspondence with both fellow scientists and the general public, we get a real sense of his probing analysis of what makes our brains tick.

It is very easy to get side-tracked from such a collection but I have to remember that my main task is to ensure the papers are in a suitable condition for the photographers to shoot: a daily scour of the collection is required to remove existing staples and flatten pages but occasionally a conservator is required. For example, Crick’s heavily folded (since 1955) tracing sketches and calculations of Collagen Long Spacings required specialist equipment to flatten out.

Once a particular batch has been checked, data spreadsheets are then produced for the photographers so that they know what to expect in each box – included in this data is an estimate of the percentage of OCR’able (Optical Character Recognition) text, a record of the current location of a particular batch and notes for the archivists’ attention. While doing this I cannot help but siphon off particularly interesting information which has and will continue to be used for publicity about the project – see the recent BBC Audio Slideshow.

Further blogs providing updates on the digitisation project will follow in due course.

Top image: Crick's sketch of genetic code, 1965 (PP/CRI/E/1/13/10)
Bottom image: Francis Crick lecturing at Cambridge University (PP/CRI/A/1/2/9)

Friday, October 1, 2010

Preparing and planning a large archives digitisation project

Archives digitisation is currently underway in our Imaging Studio, with two full-time members and two part-time members of Library staff dedicated to preparing and digitising the items. We will talk more specifically about the work being carried out on these materials on this blog in the near future, but first we present an introduction to the setup and planning of the project.

Once the theme was chosen (Modern Genetics and its Foundations), and relevant collections identified (see previous blog post), we realised that we had quite a large job on our hands. The scope of the project was bigger than anything we had done before: 620 boxes of material, containing around 800 pages each, adds up to around half a million pages to be digitised.

Based on a series of tests, we estimated the project would take 2 years to complete – starting with the preparation of the material in advance, with photography coming into play a few months later. Two full-time staff would be focused on imaging the material, with two part-time member of staff preparing, tracking and assessing the items.

There are a range of logistical issues to bear in mind when planning and starting up a project of this nature. The boxes are stored in the basement stores, and had to be retrieved for a period of some months while the material was being worked on. We divided the collections into batches of a size that could be imaged in a period of 4-6 weeks and retrieve and return each batch as a unit, tracking all movements on a spreadsheet. The tracking spreadsheet also records information such as location of each box in the batch, notes from the preparer for the archivists, photographers, and/or conservation staff, and the percentage of items in each box that can be OCR’d among other things.

We put a notice on our website of the entire schedule of archives to be digitised, so readers could see at a glance what would be unavailable and when. The catalogue records are also amended to show where material cannot be reserved. Each time a batch is retrieved, checked out, checked in, or dates altered, this has an impact on the website and two different cataloguing systems (the Archives and Manuscripts Catalogue, and the Library Catalogue), so communication with the departments responsible for retrieval and metadata was key.

The preparation staff were trained in advance by the conservation team so they could carry out basic stabilisation and first-aid work on the materials if required for digitisation. The photographers ran multiple tests on different equipment and with different cameras to ensure the workflow was efficient and appropriate to the formats of the material, the anticipated end use of the material, and to ensure proper QA could be accommodated. Preparation and imaging takes place in the Imaging Studio - ensuring that all staff are in close proximity and able to communicate easily with each other. The Imaging Studio was refitted with desks, shelving and equipment to make sure all the boxes in process at any one time could be accommodated. A further planning issue was in determining how to assess and record different levels of sensitivity of information contained in the archives. We are currently developing a policy for access to archives that takes account of online display, and this has informed the workflow for assessment.

This project required liaison between several different departments and stakeholders in the Library in order to set up a suitable workflow. In future, we hope that workflow issues will be streamlined further by procuring a Workflow Tracking System that will serve to centralise tracking and monitoring of all digitisation projects. We anticipate that this pilot project will enable us in future to plan effectively for much larger digitisation projects as we work towards the digitisation of all suitable material held in the Wellcome Library.

Friday, September 24, 2010

Digitising the archives: the Wellcome Library approach

Like most research libraries and archives repositories, the Wellcome Library is currently planning to digitise quantities of its unique holdings and provide remote access to the digitised content over the Web. Among the many challenges that such plans present, perhaps the most fundamental is the decision what to digitise, or where to start - with almost limitless potential in the holdings but limited resources what do we prioritise?

Some institutions have chosen to select their most popular collections, others those for which they can obtain commercial funding (which are often the same of course). The Wellcome Library has opted for a thematic approach: we aim to digitise a substantial proportion of our holdings by looking at various broad subject areas and creating integrated online resources to support research and discovery in those fields. Since digitisation and the internet enable the creation of virtual online archives by providing a single point of access to widely dispersed content, we intend to explore the integration of relevant content from the holdings of other institutions into the online resources that we eventually create.

The first theme, ‘Modern Genetics and its Foundations’, will focus on the development of the science of biological inheritance from the later 19th century onwards, and the growing understanding of its role in human health and disease during the 20th century. Arguably, this will represent the fundamental meta-narrative of modern medicine; the gradual integration of genetics into the clinic. Content relevant to this theme ranges from relatively early documentation on the basic science of heredity and on the study of inherited diseases, to material on the elucidation of the molecular basis of inheritance in the mid-20th century and the subsequent development of genomics.

Preparations for developing the theme are underway: over 600 boxes of personal and institutional papers held by the Wellcome Library’s archives department will be imaged to provide the substrate or bedrock of the theme. These include:

the papers of Francis Crick (1916-2004), molecular biologist and Nobel Prize winner
the notebooks of Fred Sanger (b.1918), biochemist and double Nobel Prize winner
the papers of Arthur Mourant (1904-1994), haematologist and geneticist
the papers of Hans Greuneberg (1907-1982), geneticist
the records of the MRC Blood Group Unit , 1935-95.

This material will form a core of documentation on some of the most important research on the theoretical underpinnings of the biology of inheritance, on genetics and gene sequencing in post-war Britain. To this we will add:

the papers of Sir Ernst Chain (1906-1979), biochemist and Nobel Prize winner
the papers of Norman Heatley (1911-2004), biochemist
the papers of Sir Peter Medawar (1915-1987), biologist and Nobel Prize winner
the papers of Dame Honor Fell (1900-1980), medical scientist.

Although more loosely connected with the theme, this material will help to document the contemporary scientific, intellectual and institutional context in which genetics and allied research took place.

More archival collections will be added as they become available for digitisation in future years. The selected collections will be digitised ‘cover to cover’ so their historical research potential will not be limited exclusively to questions around the given theme. We do, however, feel that the thematic approach both helps us address the issue of prioritisation in a more creative way than merely responding to perceived current user demand, and provides more potential for eventual integration of third-party content and thus the development of online virtual archives. It is in the elimination not only of geographical distance for the current researcher but also of the vagaries of historical dispersal of papers that the technologies of digitisation really come into their own.

Author: Richard Aspin, Head of Research and Scholarship, Wellcome Library

Monday, September 20, 2010

What will the Wellcome Digital Library offer?

The overall strategy of the Wellcome Digital Library is to support three activity layers aimed at different user behaviours:

Engage – by highlighting the range of material available in the Library, and using actively curated content to encourage visitors to investigate further;
Discover – allow users to investigate our holdings by searching or browsing on subject themes and names, and retrieving a mixture of actively curated and automatically generated content;
Research – enable users to conduct a single search which will identify all relevant material in the Library, including digitised and non-digitised holdings, and allow users to facet, select and manipulate this content as needed.

To support these activities the following IT systems will be procured and developed over the next 2 years:

Search and discovery – to encourage users to engage with our content;
Delivery - to provide access to the content;
Workflow – to manage all aspects of the digitisation processes;
Storage – to ensure that digitised content can be preserved securely;
Digital asset management – to manage the digital objects that are created.

Through these systems we will seek to provide our users with the ability to:

Find relevant materials through fast, accurate, and comprehensive search functions, including full-text search;
View, download and reuse content under a range of licenses, including Creative Commons licenses where appropriate;
Engage with the content through a variety of Web 2.0 and other tools that will include the ability to comment on and tag content and provide transcriptions.

Not only will the digital library be technically capable of supporting these activities, but there will be a wide range of resources on offer, with a critical mass of content from the Library's holdings. As much as possible, discreet collections will be digitised and made available in their entirety, with cover-to-cover imaging employed as standard (more soon!).

Monday, September 13, 2010

Wellcome Digital Library blog

In August 2010, the Wellcome Library announced an ambitious plan to develop a world-class digital resource for the History of Medicine. The core of this resource will be digitised content from the Library's own holdings, although funding will also be made available to others to digitise complementary collections for inclusion in the digital library.

As we move into the world of large-scale digitisation - with a short-term plan of 1m images online in the next 2 years - a number of questions, issues and opportunities await us. We have already started tackling some of the big questions, such as:

What should we digitise?
What content is of most value to researchers?
What online toolset should we offer researchers?
How can we use the digital library to encourage learning and discovery?

And of course there are the nitty-gritty technical issues, including:

Logistics of digitisation and workflows.
In-house vs. outsource options.
Metadata.
Long-term data management.
Delivery formats, speeds, and functions.

As we work through these issues, and progress with our digitisation programme, we will use our new Wellcome Digital Library blog as a real-time progress report, discussion outlet, and notification area.