Tuesday, May 15, 2012

Serving servers: a technical infrastructure plan


As we aim to provide a fast, efficient and robust technical architecture for the Wellcome Digital Library, the Wellcome Trust IT department has been working closely with our software suppliers to specify a suitable server architecture. This work is still in progress, but we now have the skeleton idea of how many servers we are likely to need and for what purposes. The scale of the architecture requirements shows that setting up and delivering digital content is a significant undertaking.

In order to serve up millions of images, plus thousands of A/V files, born digital content and the web applications that make them accessible, we believe we’ll need around 17 (virtual) servers for the production environment (the “live” services), and a further 10 servers for our staging and development environments. In the production environment, nearly every server is duplicated to ensure redundancy and a smooth delivery service, which is why the numbers are so high. The content management system coupled with its SQL database requires four servers, for example. The image delivery environment needs six servers for data delivery, on-the-fly image conversion and tile creation, and media proxy servers creating digital content URLs that divorce the user-request mechanism from the actual content held on our servers for security reasons.

Most of the servers run on Windows 2008, although our image server (IIPImage) will run on Linux Ubuntu. The virtual servers share CPUs, but the number of CPUs available mean that each server gets the equivalent of either 2 or 4 CPUs, leading to a total 48 CPU requirement (288 cores as each CPU has 6 cores) . RAM varies from 2GB to 8Gb depending on the anticipated usage of a particular application on that server. The total RAM requirement for the production architecture is estimated at 124Gb. These specifications are currently our best guess, and will be tested in the weeks to come as we start to deploy the hardware.

The staging environment allows system upgrades, patches or new development work to be applied and tested  separately from the live production environment. This means that any changes can be tested thoroughly before changes are made publicly visible and/or usable. Actual development work is carried out in the development environment, before deployment for final testing on the staging servers. This means that applications such as the web content management system and the delivery system applications must be replicated in these two additional environments, along with their server requirements.

With thanks to David Martin, IT Project Manager, as the source of my information.

Friday, May 11, 2012

Developing a player for the Wellcome Digital Library

Previous posts here have covered the digitisation of books and archives and the storage of the resulting files (mostly JPEG2000 images, but some video and audio too). Now it’s time to figure out how visitors to the Wellcome Library site actually view these materials via a web browser.

The digitisation workflow ends with various files being saved to different Library back-end systems:

  • The METS file is a single XML document that describes the structure of the book or archive, providing metadata such as title and access conditions. 
  • Each page of the book (or image of an archive) is stored as a JPEG2000 file in the Library’s asset management system, Safety Deposit Box (SDB). Each image file in SDB has a unique filename (in fact a GUID), and this is referenced in the METS file. So given the METS file and access to the asset management system, we could retrieve the correct JPEG 2000 images in the correct order. 
  • Additional files might be created, such as METS-ALTO files containing information about the positions of individual words on a digitised page; we’ll want to use this information to highlight search results within the text. 
So how do we use these files to allow a site visitor to read a book?

Rendering JPEG 2000 files

Our first problem is that we can’t just serve up a JPEG2000 image to a web browser – the format is not supported. And even if it was, the archival JPEG2000 files are large: several megabytes each. The solution to this problem is familiar from services like Google Maps – we break the raw image up into web-friendly tiles and use them at different resolutions (zoom levels). When you use Google Maps, you can keep dragging the map around to explore pretty much anywhere on Earth – but your browser didn’t load one single enormous map of the world. Instead, the map is delivered to you as 256x256 pixel image files called tiles, and your browser only makes requests for those tiles that are needed to show the area of the map visible in your browser’s viewport. Each tile is quite small and hence very quick to download – here’s a Google map tile that shows the Wellcome Library:

http://mt1.google.com/vt/lyrs=m@176000000&hl=en&src=app&x=65487&s=&y=43573&z=17&s=Ga

Google Maps is a complex JavaScript application that causes your browser to load the right tiles at the right time (and in the right place). This keeps the user experience slick. We need that kind of user experience to view the pages of books.

There are several JavaScript libraries available that solve the difficult problem of handling the viewport and generating the correct tile requests in response to user pan and zoom activity. We’ve settled on Seadragon, because we really like the way it zooms smoothly (via alpha blending as you move from one zoom level’s tiles to another). A very nice existing example of this is at the Cambridge Digital Library’s Newton Papers project:

http://cudl.lib.cam.ac.uk/view/PR-ADV-B-00039-00001/

This site uses a viewer built around Seadragon; an individual tile looks like this:

http://cudl.lib.cam.ac.uk/content/images/PR-ADV-B-00039-00001-000-00105_files/11/3_2.jpg

The numbers on the end indicate that this jpeg tile is for zoom level 11, column 3, row 2. As you explore the image, your browser makes dozens, even hundreds of individual tile requests like this. It feels fast because each individual tile is tiny and downloads in no time.

For more about tiled zoomable images, these blog posts are an excellent introduction:

So how do we get from a single JPEG2000 image to hundreds (or even thousands) of JPG tiles? It’s possible to prepare your image tiles in advance, so that you process the source image once and store folders of prepared tiles on your web server. For small collections of images this is a simple way to go and doesn’t require anything special on the server. But for the Library, it’s not practical – we don’t want to prepare tiles as part of the digitisation workflow. They are not “archival”, and they take up a lot of extra storage space. We need something that can generate tiles on the fly from the source image, given the tile requests coming from the browser.

 For this we need an Image Server, and we’ve chosen IIPImage for its performance and native Seadragon (Deep Zoom) support. The Image Server generates browser-friendly JPEG images from regions of the source image at particular zoom levels. When your browser makes a request to the image server for a particular tile, the image server extracts the required region from the source JPEG 2000 file and serves it up to you an ordinary JPEG.

Viewer or Player? Or Reader? 

The next piece of the puzzle is the browser application that makes the requests to the server. A book or archive is a sequence of images along with a lot of other metadata. And it’s not just books – the Library also has video and audio content. All of these are described in detail by METS files produced during the digitisation/ingest workflow. In the world of tile-based imaging, the term “viewer” is often used to describe the browser component of the system, but we seem to have fallen naturally to using the term “Player” to describe it – it plays books, videos and audio, so “Player” it is. Our player needs to be given quite a lot of data to know what to play.

We could just expose the METS file directly, but it is large and complex and much of it is not required in the Player. So we’re developing an intermediate data format, which effectively acts as the public API of the Library. Given a Library catalogue number, the player requests a chunk of data from the server; this tells it everything it needs to know to play the work, in a much simpler format than the METS file. In the future other systems could make use of this API (at the moment it’s exposed as JSON).

The user experience 

The user won’t just be viewing a sequence of images, like a slide show. It should be a pleasant experience to read a book from cover to cover. Many users will be using a tablet, reading pages in portrait aspect ratio. We aim to make this a good e-reading experience too, augmented by search and navigation tools.

The user experience might start with a search result from the Library’s main search tool. For books that have been digitised, the results page will provide an additional link directly to the player “playing” the digitised book. The URL of the book is an important part of the user experience, and we want to keep it simple. In future, library.wellcome.ac.uk/player/b123456 would be the URL of the work with catalogue refrence number b123456; that URL would take you straight to the player.

We want to be able to link directly to a particular page of a particular book, just as a printed citation could. This deeper URL would be /player/b123456#/35. But we can do better than that; our URL structure should extend to describe the precise region of a page, so that one reader could line up a particular section of text on a page, or a picture, and send the URL to another reader; the second reader would see the work open at the same page, and zoomed in on the same detail.

Access Control 

Much of the material being made available is still subject to copyright. Those works that are cleared for online publication by the Trust’s copyright clearance strategy still need some degree of access control applied to them; typically the user will be required to register before viewing them. This represents a significant architectural challenge, because we need to enforce access restrictions down to the level of individual tile requests. We don’t want anyone “scraping” protected content by making requests for the tiles directly, bypassing the player.

Performance and Scale 

As well as the technical challenges involved in building the Player, we need to ensure that content is served to the player quickly. Ultimately the system will need to scale to serve millions of different book pages. Between the player and the back end files is a significant middle tier: the Digital Delivery System, of which the Player is a client. This layer is the Library’s API for Digital Delivery. The browser-based player interacts with it to retrieve data to display a book, highlight search results, generate navigation and so on. The Image Server is a key component of this system.

This post was written by Tom Crane, Lead Developer at Digirati, working with his colleagues on developing digital library solutions for the Wellcome Digital Library.