AutExp09:web-native-notebook

If we are to consider a web-native approach to capturing the scientific record we need to consider the laboratory notebook. The lab notebook is, at its core, a journal of events, an episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. There are strong analogies between this view of the lab notebook as a journal and the functionality of Blogs. Blogs contain posts which are dated, usually linked to a single author, and may contain embedded digital objects such as images or videos, or indeed graphs and charts generated from online datasets. In fact most people who use existing online services as laboratory notebooks use Wikis rather than blogs (http://usefulchem.wikispaces.com, http://deferentialgeometry.org). This is for a number of reasons; better user interfaces, sometimes better services and functionality, stronger versioning, and in some cases a personal preference. At one level this distinction is important because it is a strong indicator of the functionality and interface requirements for a desirable online lab notebook. However in another way the distinction is unimportant. Wikis and Blogs are both date stamped, have authorship information, and enable commenting. And most importantly both create objects (posts or pages) that are individually addressable on the web via a URL.

A "semantic web ready" laboratory record

The creation of individually addressable objects is crucial because it enables these objects, whether they are datasets, protocols, or pointers to physical objects such as samples, to take a part in the semantic web (Shadbolt et al 2006). The root concept of the semantic web is that the relationships between objects can be encoded and described. For this to be possible those objects must be uniquely addressable in some form on the web. By creating individual posts or pages the researcher is creating these individual objects; and again these can represent physical objects, processes, or data. The relationships between these objects can be described separately for example in a triplestore, or locally via statements within the posts. However the simplest way to express relationships that directly leverages the existing toolset on the web is by linking posts together.

Feeds change the lab notebook from a personal record to a collaborative document

The other key functionality of the web to focus on is that of the "feed". Feeds, whether they are RSS or Atom are XML documents that are regularly updated providing a stream of "events" which can then be consumed by various readers, Google Reader being one of the most popular. Along with the idea of hyperlinks between objects the feed provides the crucial difference between the paper based and web-native lab notebook. A paper notebook (whether it is a physical object or "electronic paper") is a personal record. The web-native lab notebook is a collaborative notification tool that announces when something has happened, when a sample has been created, or a piece of data analysed.

Despite of the historical tendency to isolated research groups discussed above, these independent groups described above are banding together as research funders demand larger coordinated projects. Tasks are divided up by expertise and in many cases also divided geographically between groups that have in the past probably not even had good internal communication systems. Rapid and effective communication between groups on the details of ongoing projects is becoming more and more important and is increasingly a serious deficiency in the management of these collaborations. In addition reporting back to sponsors via formal reports is an increasing burden. The notification systems enabled via the generation of feeds go a significant way towards providing a means of dealing with this issues. Within a group the use of feeds and feed readers can provide an extremely effective means of pushing information to those who need to either track or interact with it (Figure #). It is not a major step from this to providing streams of information that provide highlights for project sponsors. The web native lab notebook brings the collaborative authoring and discussion tools provided by the read-write web to bear on the problem of communicating research results.

3612006850_65d2387d41.jpg

Figure 1. Using feeds and feed readers to aggregate and push laboratory records. A) A screenshot of Google Reader showing an aggregated feed of laboratory notebook entries from http://biolab.isis.rl.ac.uk. Two buttons are highlighted which enable "sharing" to anybody who follows the user's feed or adding a tag. B) Sharing can also include annotating the entry with further information or tagging the entry to place it in a specific category. C) A new feed is created for each tag which can also be consumed by readers with a specific interest such as collaborators, regulatory agencies, or funders.

Integrating tools and services

With the general concept of the record as a blog in place, enabling us to create a set of individually addressable objects, and link them together, as well as providing feeds describing the creation of these objects, we can consider what tools and services we need to author and to interact with these objects. Again blogs provide a good model here as many widely used authoring tools can be used directly to create documents and publish them to blog systems. Recent versions of Microsoft Office include the option of publishing documents to blogs and other online services. A wide variety of web based tools and plugins are available to make the creation and linking of blog posts easy. Particularly noteworthy are tools such as Zemanta, a plugin which automatically suggests appropriate links for concepts within a post (http://zemanta.com). Zemanta scans the text of a post and identifies company names, concepts that are described in Wikipedia and other online information sources, using an online database that is built up from the links created by other users of the plugin.

Sophisticated semantic authoring tools such as the Integrated Content Environment (ICE) developed at the University of Southern Queensland (Sefton, 2006) provide a means of directly authoring semantic documents that can then be published to the web. ICE can also be configured to incorporate domain specific semantic objects that generate rich media representations such as three dimensional molecular models. These tools are rapidly become very powerful and highly useable, and will play an important role in the future by making rich document authoring straightforward.

Where do we put the data?

With the authoring of documents in hand we can consider the appropriate way of handling data files. At first sight it may seem simplest to upload data files and embed them directly in blog posts. However, the model of the blog points us in a different direction here again. On a blog images and video are not generally uploaded directly, they are hosted on an appropriate,sepcialised, external service and then embedded them on the blog page. Issues about managing the content and providing a highly user-friendly viewer are handled by the external data service. Hosting services are optimized for handling specific types of conten; Flickr for photos, YouTube (or Viddler or Bioscreencast) for video, Slideshare for presentations, Scribd for documents. In an ideal world there would be a trustworthy data hosting service, optimized for your specific type of data, that would provide cut and paste embed codes providing the appropriate visualizations in the same way that videos from YouTube can easily be embedded.

Some elements of these services exist for research data. Trusted repositories exist for structural data, for gene and protein sequences, and for chemical information. Large-scale projects are often required to put a specific repository infrastructure in place to make the data they generate available. And in most cases it is possible to provide a URL which points at a specific data item or dataset. It is therefore possible in many cases to provide a link directly to a dataservice that places a specific dataset in context and can be relied on to have some level of curation or quality control and provide additional functionality appropriate to the datatype. What is less prevalent is the type of embedding functionality provided by many consumer data repository services.

ChemSpider (recently purchased by the Royal Society of Chemistry) is one example of a service that does enable the embedding of both molecules and spectra into external web pages. This is still clearly an area for development and there are discussions to be had about both the behind the scenes implementation of these services as well as the user experience but it is clear that this kind of functionality could play a useful role in helping researchers to connect information on the web up. If multiple researchers use the ChemSpider molecule embedding service to reference a specific molecule then all of those separate documents can be unambiguously assigned as describing the same molecule. This linking up of individual objects through shared identifiers is precisely what gives the semantic web its potential power.

A more general question is the extent to which such repositories can or will be provided and supported for less common data types. The long term funding of such data repositories is at best uncertain and at worst non-existent. Institutional repositories are starting to play a role in data archiving and some research funders are showing an interest. However there is currently little or no coordinated response to the problem of how to deal with archiving data in general. Piecemeal solutions and local archiving are likely to play a significant role. This does not necessarily make the vision of linked data impossible, all that is required is that the data be placed somewhere where it can be referenced via a URL. But the extent to which specialist data repositories can be resourced will determine the extent to which rich functionality to manipulate and visualize that data will be available. In our model of the Blog as a lab notebook a piece of data can be uploaded directly to a post within the blog. This provides the URL for the data, but will not in and of itself enable visualization or manipulation. Nonetheless the data will remain accessible and addressable in this form.

A key benefit of this way of thinking about the laboratory record is that items can be distributed in many places depending on what is appropriate. What it also means is that it is possible to apply page rank style algorithms and link analysis more generally in looking at large quantities of posts. Most importantly it encodes the relationship between objects, samples, procedures, data, and analysis in the way the web is tooled up to understand; the relationships are encoded in links. This is a lightweight way of starting to build up a web of data – it doesn’t matter so much to start with whether this is in RDF as long as there is enough contextual data to make it useful. Some tagging or key-value pairs would be a good start. Most importantly it means that it doesn’t matter at all where our data files are as long as we can point at them with sufficient precision.

Distributed sample logging systems

The same logic of distributing data according to where it is most appropriate to store it can also be applied to samples. In many cases, tools such as Laboratory Information Management System or sample databases will already be in place. Although in most cases they are likely to applied to a specific subset of the physical objects being handled; a LIMS for analytical samples, a spreadsheet for oligonucleotides, and a local database, often derived from a card index, for lab chemicals? As long as it is possible to point at each physical object independently with the required precision you need then these systems can be used directly. Although a local spreadsheet may not be addressable at the level of individual rows GoogleSpreadsheets can be addressed in this way. Individual cells can be addressed via a URL for each cell and there is a powerful API that makes it possible to build services to make the creation of links easy. Web interfaces can provide the means of addressing databases via URL through any web browser or http capable tool.

Again, samples and chemical can be represented by posts within a Blog, this provides the same functionality, a URL endpoint that represents that object and this may be appropriate for small laboratories. When samples involve a wide variety of different materials put to different uses, the flexibility of using an open system of posts rather than a database with a defined schema can be helpful. But for other many other purposes this may not be the case. It may be better to use multiple different systems, a database for oligonucleotides, a spreadsheet for environmental samples, and a full blown LIMS for barcoding and following the samples through preparation for sequencing. As long; as it can be pointed at, it can be used. Similar to the data case, it is best to use a system that is designed for or best suited to that specific set of samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at specific samples from an external document – and very few make it possible to do this via a simple http compliant URL.

Full distribution of materials, data, and process: The lab notebook as a feed of relationships

At this point it may seem that the core remaining component of the lab notebook is the description of the actions that link material objects and data files the record of process. However even these records could be passed to external services that might be better suited to the job. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). These may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where, once again, it is possible to unambiguously point at them.

What we are left with is the links that describe the relationship between materials, data, and process, arranged along a timeline. The laboratory record, the web-native laboratory notebook, is reduced to a feed which describes these relationships; that notifies users when a new relationship is created or captured. This could be a simple feed containing plain hyperlinks or it might be a sophisticated and rich feed which uses one or more formal vocabularies to describe the semantic relationship between items. In principle it is possible to mix both, gaining the best of detailed formal information where it is available but linking in relationships that are less clearly described where possible. That is, this approach can provide a way of building up a linked web of data and objects piece by piece, even when the details of vocabularies are not yet agreed or in place.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution 3.0 License