Head In The Clouds - Automated Experimentation

A paper being prepared for the new BMC Journal Automated Experimentation. Adapted from Blog posts The integrated laboratory record, and Capturing the record of research process Part I and Part II.

Head in the Clouds: Re-imagining the experimental laboratory record for the web-based networked world


Automated experimentation brings the promise of a much improved record of the research process. Where experiments are sufficiently well defined that they can be carried out by automated instrumentation or computational resources it is to be expected that an excellent record of process can and will be created. In "Big Science" projects from particle physics [##REF##] to genome sequencing (Batley and Edwards, 2009) the sharing of records about samples and objects, experimental conditions and outputs, and the processing of data is a central part of planning and infrastructure, and often a central part of justifying the investment of resources. As some segments biological science have become industrialized with greater emphasis on high throughput analysis and the generation of large quantities of data sophisticated systems have been developed to track the experimental process and to describe and codify the results of experiments through controlled vocabularies, minimal description standards (Taylor et al, 2008), and ontologies (Smith et al, 2007).

None of this has had a major impact on the recording process applied to the vast majority of research experiments, which are still carried out by single people or small teams in relative isolation from other research groups. The vast majority of academic research is still recorded in paper notebooks and even in industry the adoption of electronic recording systems is relatively recent and remains patchy. A paper notebook remains an excellent means of planning and recording experiments. In most modern laboratories, however, it is starting to fail as an effective means of recording, collating, and sharing data.

The majority of scientific data generated today is born digital. In some cases printouts make it into bound notebooks. In most cases the data remains distributed on a collection of laboratory and personal hard disks. The record of data analysis, the conversion of that digital data into new digital objects and finally into scientific conclusions is, in most cases, very poorly recorded. It is noteworthy in this context that a number of groups have felt it necessary to take an active advocacy position in trying to encourage the wider community that the reproducibility of data analysis is a requirement, and not an added bonus (http://reproducibleresearch.org, Pedersen, 2008). The promise of digital recording of the research process is that it can create a reliable record that would support automated reproduction and critical analysis of research results. The challenge is that the tools for generating these digital records must outperform a paper notebook while simultaneously providing enough advanced and novel functionality to convince users of the value of switching.

At the same time the current low level of adoption means that that field is wide open for a radical re-imagining of how the record of research can be created and used. It lets us think deeply about what value the different elements of that record have for use and re-use and to take inspiration from the wide variety of web-based data and object management tools that have been developed for the mass consumer market. This paper will describe a new way of thinking about the research record that is rooted in the way that the World Wide Web works and consider the design patterns that will most effectively utilize existing and future infrastructure to provide a useful and effective record.

The distinction between capturing process and describing an experiment

In discussing tools and services for recording the process of research there is a crucial distinction to be made between capturing a record of as it happens and describing an experiment after the event. A large part of the tension between those researchers developing systems for describing the outputs of research in structured form and the research scientists carrying out the experiments in the laboratory derives from a misunderstanding about what is being recorded. The best way to maximise success in recording the important details of a research process is to capture those details as they happen, or in the case of plans, before they happen. However most controlled vocabularies and description systems are built, whether explicitly or implicitly, with the intention of describing the knowledge that results from a set of experiments, after the results have been considered. This is seen mostly clearly in ontologies that place a hypothesis at the core of the descriptive structure or assume that the "experiment" is a clearly defined entity.

These approaches work well for the highly controlled, indeed, industrialised studies that they were generally designed around. However they tend to fail when applied to small scale and individual research, and particularly in the situations where someone is "trying something out". Most of the efforts to describe or plan research start with the concept of an "experiment" that is designed to test a "hypothesis" (see e.g. Jones et al, 2007, King et al, 2009). Very often the concept of "the hypothesis" doesn't usefully apply to the detail of the experimental steps that need to be recorded. And the details of where a specific experiment starts and finishes are often dependent on the viewer, the state of the research, or the choices made in how to publish and present that research after the fact. Products or processes may be part of multiple projects or may be later used in multiple projects. A story will be constructed later, out of these elements, to write a paper or submit a database entry but at the time the elements of this story are captured the framework may be vague or non-existent. Unexpected results clearly do not fit into an existing framework but can be the launching point for a whole new programme. The challenge therefore is to capture the elements of the research process in such a way that the sophisticated and powerful tools developed for structured description of knowledge can be readily applied once the story starts to take form.

The Web Native Lab Notebook

If we are to consider a web-native approach to capturing the scientific record we need to consider the laboratory notebook. The lab notebook is, at its core, a journal of events, an episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. There are strong analogies between this view of the lab notebook as a journal and the functionality of Blogs. Blogs contain posts which are dated, usually linked to a single author, and may contain embedded digital objects such as images or videos, or indeed graphs and charts generated from online datasets. In fact most people who use existing online services as laboratory notebooks use Wikis rather than blogs (http://usefulchem.wikispaces.com, http://deferentialgeometry.org). This is for a number of reasons; better user interfaces, sometimes better services and functionality, stronger versioning, and in some cases a personal preference. At one level this distinction is important because it is a strong indicator of the functionality and interface requirements for a desirable online lab notebook. However in another way the distinction is unimportant. Wikis and Blogs are both date stamped, have authorship information, and enable commenting. And most importantly both create objects (posts or pages) that are individually addressable on the web via a URL.

A "semantic web ready" laboratory record

The creation of individually addressable objects is crucial because it enables these objects, whether they are datasets, protocols, or pointers to physical objects such as samples, to take a part in the semantic web (Shadbolt et al 2006). The root concept of the semantic web is that the relationships between objects can be encoded and described. For this to be possible those objects must be uniquely addressable in some form on the web. By creating individual posts or pages the researcher is creating these individual objects; and again these can represent physical objects, processes, or data. The relationships between these objects can be described separately for example in a triplestore, or locally via statements within the posts. However the simplest way to express relationships that directly leverages the existing toolset on the web is by linking posts together.

Feeds change the lab notebook from a personal record to a collaborative document

The other key functionality of the web to focus on is that of the "feed". Feeds, whether they are RSS or Atom are XML documents that are regularly updated providing a stream of "events" which can then be consumed by various readers, Google Reader being one of the most popular. Along with the idea of hyperlinks between objects the feed provides the crucial difference between the paper based and web-native lab notebook. A paper notebook (whether it is a physical object or "electronic paper") is a personal record. The web-native lab notebook is a collaborative notification tool that announces when something has happened, when a sample has been created, or a piece of data analysed.

Despite of the historical tendency to isolated research groups discussed above, these independent groups described above are banding together as research funders demand larger coordinated projects. Tasks are divided up by expertise and in many cases also divided geographically between groups that have in the past probably not even had good internal communication systems. Rapid and effective communication between groups on the details of ongoing projects is becoming more and more important and is increasingly a serious deficiency in the management of these collaborations. In addition reporting back to sponsors via formal reports is an increasing burden. The notification systems enabled via the generation of feeds go a significant way towards providing a means of dealing with this issues. Within a group the use of feeds and feed readers can provide an extremely effective means of pushing information to those who need to either track or interact with it (Figure #). It is not a major step from this to providing streams of information that provide highlights for project sponsors. The web native lab notebook brings the collaborative authoring and discussion tools provided by the read-write web to bear on the problem of communicating research results.


Figure 1. Using feeds and feed readers to aggregate and push laboratory records. A) A screenshot of Google Reader showing an aggregated feed of laboratory notebook entries from http://biolab.isis.rl.ac.uk. Two buttons are highlighted which enable "sharing" to anybody who follows the user's feed or adding a tag. B) Sharing can also include annotating the entry with further information or tagging the entry to place it in a specific category. C) A new feed is created for each tag which can also be consumed by readers with a specific interest such as collaborators, regulatory agencies, or funders.

Integrating tools and services

With the general concept of the record as a blog in place, enabling us to create a set of individually addressable objects, and link them together, as well as providing feeds describing the creation of these objects, we can consider what tools and services we need to author and to interact with these objects. Again blogs provide a good model here as many widely used authoring tools can be used directly to create documents and publish them to blog systems. Recent versions of Microsoft Office include the option of publishing documents to blogs and other online services. A wide variety of web based tools and plugins are available to make the creation and linking of blog posts easy. Particularly noteworthy are tools such as Zemanta, a plugin which automatically suggests appropriate links for concepts within a post (http://zemanta.com). Zemanta scans the text of a post and identifies company names, concepts that are described in Wikipedia and other online information sources, using an online database that is built up from the links created by other users of the plugin.

Sophisticated semantic authoring tools such as the Integrated Content Environment (ICE) developed at the University of Southern Queensland (Sefton, 2006) provide a means of directly authoring semantic documents that can then be published to the web. ICE can also be configured to incorporate domain specific semantic objects that generate rich media representations such as three dimensional molecular models. These tools are rapidly become very powerful and highly useable, and will play an important role in the future by making rich document authoring straightforward.

Where do we put the data?

With the authoring of documents in hand we can consider the appropriate way of handling data files. At first sight it may seem simplest to upload data files and embed them directly in blog posts. However, the model of the blog points us in a different direction here again. On a blog images and video are not generally uploaded directly, they are hosted on an appropriate,sepcialised, external service and then embedded them on the blog page. Issues about managing the content and providing a highly user-friendly viewer are handled by the external data service. Hosting services are optimized for handling specific types of conten; Flickr for photos, YouTube (or Viddler or Bioscreencast) for video, Slideshare for presentations, Scribd for documents. In an ideal world there would be a trustworthy data hosting service, optimized for your specific type of data, that would provide cut and paste embed codes providing the appropriate visualizations in the same way that videos from YouTube can easily be embedded.

Some elements of these services exist for research data. Trusted repositories exist for structural data, for gene and protein sequences, and for chemical information. Large-scale projects are often required to put a specific repository infrastructure in place to make the data they generate available. And in most cases it is possible to provide a URL which points at a specific data item or dataset. It is therefore possible in many cases to provide a link directly to a dataservice that places a specific dataset in context and can be relied on to have some level of curation or quality control and provide additional functionality appropriate to the datatype. What is less prevalent is the type of embedding functionality provided by many consumer data repository services.

ChemSpider (recently purchased by the Royal Society of Chemistry) is one example of a service that does enable the embedding of both molecules and spectra into external web pages. This is still clearly an area for development and there are discussions to be had about both the behind the scenes implementation of these services as well as the user experience but it is clear that this kind of functionality could play a useful role in helping researchers to connect information on the web up. If multiple researchers use the ChemSpider molecule embedding service to reference a specific molecule then all of those separate documents can be unambiguously assigned as describing the same molecule. This linking up of individual objects through shared identifiers is precisely what gives the semantic web its potential power.

A more general question is the extent to which such repositories can or will be provided and supported for less common data types. The long term funding of such data repositories is at best uncertain and at worst non-existent. Institutional repositories are starting to play a role in data archiving and some research funders are showing an interest. However there is currently little or no coordinated response to the problem of how to deal with archiving data in general. Piecemeal solutions and local archiving are likely to play a significant role. This does not necessarily make the vision of linked data impossible, all that is required is that the data be placed somewhere where it can be referenced via a URL. But the extent to which specialist data repositories can be resourced will determine the extent to which rich functionality to manipulate and visualize that data will be available. In our model of the Blog as a lab notebook a piece of data can be uploaded directly to a post within the blog. This provides the URL for the data, but will not in and of itself enable visualization or manipulation. Nonetheless the data will remain accessible and addressable in this form.

A key benefit of this way of thinking about the laboratory record is that items can be distributed in many places depending on what is appropriate. What it also means is that it is possible to apply page rank style algorithms and link analysis more generally in looking at large quantities of posts. Most importantly it encodes the relationship between objects, samples, procedures, data, and analysis in the way the web is tooled up to understand; the relationships are encoded in links. This is a lightweight way of starting to build up a web of data – it doesn’t matter so much to start with whether this is in RDF as long as there is enough contextual data to make it useful. Some tagging or key-value pairs would be a good start. Most importantly it means that it doesn’t matter at all where our data files are as long as we can point at them with sufficient precision.

Distributed sample logging systems

The same logic of distributing data according to where it is most appropriate to store it can also be applied to samples. In many cases, tools such as Laboratory Information Management System or sample databases will already be in place. Although in most cases they are likely to applied to a specific subset of the physical objects being handled; a LIMS for analytical samples, a spreadsheet for oligonucleotides, and a local database, often derived from a card index, for lab chemicals? As long as it is possible to point at each physical object independently with the required precision you need then these systems can be used directly. Although a local spreadsheet may not be addressable at the level of individual rows GoogleSpreadsheets can be addressed in this way. Individual cells can be addressed via a URL for each cell and there is a powerful API that makes it possible to build services to make the creation of links easy. Web interfaces can provide the means of addressing databases via URL through any web browser or http capable tool.

Again, samples and chemical can be represented by posts within a Blog, this provides the same functionality, a URL endpoint that represents that object and this may be appropriate for small laboratories. When samples involve a wide variety of different materials put to different uses, the flexibility of using an open system of posts rather than a database with a defined schema can be helpful. But for other many other purposes this may not be the case. It may be better to use multiple different systems, a database for oligonucleotides, a spreadsheet for environmental samples, and a full blown LIMS for barcoding and following the samples through preparation for sequencing. As long; as it can be pointed at, it can be used. Similar to the data case, it is best to use a system that is designed for or best suited to that specific set of samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at specific samples from an external document – and very few make it possible to do this via a simple http compliant URL.

Full distribution of materials, data, and process: The lab notebook as a feed of relationships

At this point it may seem that the core remaining component of the lab notebook is the description of the actions that link material objects and data files the record of process. However even these records could be passed to external services that might be better suited to the job. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). These may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where, once again, it is possible to unambiguously point at them.

What we are left with is the links that describe the relationship between materials, data, and process, arranged along a timeline. The laboratory record, the web-native laboratory notebook, is reduced to a feed which describes these relationships; that notifies users when a new relationship is created or captured. This could be a simple feed containing plain hyperlinks or it might be a sophisticated and rich feed which uses one or more formal vocabularies to describe the semantic relationship between items. In principle it is possible to mix both, gaining the best of detailed formal information where it is available but linking in relationships that are less clearly described where possible. That is, this approach can provide a way of building up a linked web of data and objects piece by piece, even when the details of vocabularies are not yet agreed or in place.

Implementation: Tools and services

At one level this framework of links and objects can be put together out of existing pieces from online tools and services but on another the existing tools are totally inadequate. While their are a wide range of freely accessible sites for hosting data of different or arbitrary types, documents, and bookmarks, and these can be linked together in various ways there are very few tools and services that provide the functionality of managing links within the kind of user friendly environment that would be like to encourage adoption. Most existing web tools have also been built to be "sticky" so as to keep users on the site. This means that they are often not good at providing functionality to link out to objects on other services.

The linked data web-native notebook described above could potentially be implemented using existing tools. A full implementation would involve a variety of structured documents placed on various services using specified controlled vocabularies. The relationships between these documents would then be specified by either an XML feed generated via some other service, or by a feed generated by depositions to a triple store. Either of these approaches would then require a controlled vocabulary or vocabularies to be used in description of relationships and therefore the feed.

In practice, while this is technically feasible, for the average researcher the vocabularies are often not available or not appropriate. The tools for generating semantic documents, whether XML or RDF based are not, where they exist at all, designed with the general user in mind. The average lab is therefore restricted to a piecemeal approach based on existing, often general consumer web services. This approach can go some distance, using wikis, online documents, and data visualization extensions. An example of this approach is described in (Bradley et al., 2009) where a combination of Wikis, GoogleDoc Spreadsheets and visualization tools based on the GoogleChart API were used in a distributed set of pages that linked data representations to basic data to the procedures used to generate it through simple links. However, it currently can’t exploit the full potential of a semantic approach.

Clearly there is a gap in the tool set for delivering a linked experimental record. But what is required to fill that gap? High quality dataservices are required and are starting to appear in specific areas with a range of business models. Many of those that exist in the consumer space already provide the type of functionality that would be required including RSS feeds, visualization and management tools, tagging and categorization, and embedding capabilities. Slideshare and Flickr are excellent models for scientific data repositories in many ways.

Sophisticated collaborative online and offline document authoring tools are available. Online tools include blogs and wikis, and increasingly offline tools including Microsoft Word and Open Office, provide a rich and useable user experience that is well integrated with online publishing systems. Tools such as ICE can provide a sophisticated semantic authoring environment making it possible to effectively link structured information together.

The missing link

What is missing are the tools that will make it easy to connect items together. Semantic authoring systems solve part of the problem by enabling the creation of structured documents and in some cases by assisting in the creation of links between objects. However these are usually inward looking. The key to the web-native record is that it is integrated into the wider web, monitoring sites and feeds for new objects that may need to be incorporated into the record. These monitoring services would then present these objects to the user within a semantic authoring tool, ideally in a contextual manner.

The conceptually simplest system would monitor appropriate feeds for new objects and present these to the user as possible inputs and outputs. The user would then select an appropriate set of inputs and outputs and select the relationship between them from a limited set of possible relationships (is an input to, is an output of, generated data). This could be implemented as three drop down menus in it's simplest form but this would only apply after the event. Such a tool would not be particularly useful in planning experiments or as a first pass recording tool and would therefore add another step to the recording process.


Figure 2. A conceptual tool for connecting web objects together via simple relationships. A range of web services are monitored to identify new objects, data, processes, documents, or controlled vocabulary terms that may be relevant to the uesr. In this simple tool these are presented as potential subjects and objects in drop down down menus. The relationship can be be selected from a central menu. The output of the tool is a feed of relationships between web accessible objects.

Capturing in silico process: A new type of log file

In an ideal world, the whole process of creating the linked record would be done behind the scenes, without requiring the intervention of the user. This is likely to be challenging in the experimental laboratory but is entirely feasible for in silico work and data analysis within defined tools. Many data analysis tools already generate log files, although as data analysis tools have become more GUI driven these have become less obvious to the user and more focussed on aiding in technical support. Within a given data analysis or computational tool objects will be created and operated on by procedures hard coded into the system. The relationships are therefore absolute and defined.

The extensive work of the reproducible research movement on developing approaches and standards to recording and communicating computational procedures has largely focussed on the production of log files and command records (or scripts) that can be used to reproduce an analysis procedure as well as arguing for the necessity to provide running code and the input and intermediate data files. In the linked record world it will be necessary to create one more "logfile" that describes the relationships between all the objects created by reference to some agreed vocabulary. This "relationships logfile" which would ideally be RDF or a similar framework is implicit in a traditional log file or script but by making it explicit it will be possible to wire these computational processes into a wider web of data automatically. Thus the area of computational analysis is where the most rapid gains can be expected to be made as well as where the highest standards are likely to be possible. The challenge of creating similar log files for experimental research is greater, but will benefit significantly from building up experience in the in silico world.

Systems for capturing physical processes

For physical experimentation it is possible to imagine an authoring tool that automatically tracks feeds of possible input and output objects as the researcher describes what they are planning or what they are doing. The authoring tool would trigger a plugin or internal system to identify points where new links should be made, based on the document that the researcher is generating as they plan or execute their experiment. For example, typing the sentence, “an image was taken of the sample for treatment A” would trigger the system to look at recent items from the feed (or feeds) of the appropriate image service(s), which in turn would be presented to the user in a drop down menu for selection. The selection of the correct item would add a link from the document to the image. The “sample for treatment A” having already been defined a statement would then be incorporated in a machine readable form within the document that “sample for treatment A” was the input for the process “an image was taken” which generated data (the image).

Such a system would consist of two parts; first an intelligent feed reader which monitors all the relevant feeds, laboratory management systems for samples, data repositories, or laboratory instruments, for new data. In a world in which it seems natural that the Mars Phoenix lander should have a Twitter account, and indeed the high throughput sequencers at the Sanger Centre send status updates via Twitter, the notion of an instrument automatically providing a status update on the production of data seems natural. What are less prevalent are similar feeds generated by sample and laboratory information management systems although the principles are precisely equivalent; when an object (data or sample or material) is created or added to the system a feed item is created and pushed out for notification.

The second part of the system is more challenging. Providing a means of linking inputs to outputs, via for example drop down menus is relatively straightforward. Even the natural language processing required to automatically recognise where links should be created is feasible with todays technology. But capturing what those links mean is more difficult. There is a balance that needs to be found between providing sufficient flexibility to describe a very wide range of possible connections ("is an input to", "is the data generated from") and providing enough information, via a rigorously designed controlled vocabulary, to enable detailed automated parsing by machines. Where existing vocabularies and ontologies exist and are directly applicable to the problem in hand these should clearly be used and strongly supported by the authoring tools. In the many situations where they are not ideal then more flexibility needs to be supported allowing the researcher to go "off piste" but at the same time ambiguity should be avoided. Thus such a system needs to "nudge" the user into using the most appropriate available structured description of their experiment, while allowing them the choice to use none, but at the same time helping them avoid using terms or descriptions that could be misintepreted. Not a trivial system to build.

A wave to wash the problems away?

The majority of this article was prepared in the weeks leading up to the public demonstration of Google Wave, a federated protocol for preparing collaborative documents with real time interactions (http://wave.google.com). Alongside this Wave provides means of adding powerful functionality both to the document and to act over the document. At the time of writing only third party reports and some documentation were available but it is clear that Waves have a huge potential to provide a framework that will make it possible to deliver on many aspects of the program described here.

The Wave protocol and framework is certain to have a large impact on communication online and will probably replace email and instant messenging in the medium to longer term. However the key functionality it provides for the current discussion is a collaborative document authoring system in which multiple participants, either human or automated, can act at the same time. The Wave protocol framework provides two types of automated functionality, Robots which are automated participants in the authoring of the document, and Gadgets, which can be used to modify the display and content of specific parts of the document.

Automated participants in the research record: Instruments, databases, and vocabularies

In the context of the current discussion there are two main applications of this technology. Firstly the use of robots to represent either instruments or sample management systems. As a document that describes a procedure is being authored the researcher can define the types of input or output that are expected by including specific participants in the document. For a PCR reaction these participants might be an oligonucleotide database, a thermal cycler, and a gel imager. By including these participants in the "conversation" the researcher flags the specific feeds that will contain inputs and outputs of the process. This can be taken one step further by creating a robot participant that represents a minimal description requirement or controlled vocabulary for a specific type of experiment. The robot, interacting with a web service that provides the most up to date description of the required information, would then automatically add the other appropriate robot participants and provide a formatted interface for selecting from the possible inputs and outputs they provide. Because the document remains live, if the vocabulary or description requirements change, the robot can attempt to update the description by inferring it from existing information, by interacting with the other automated participants, or by returning to the human researcher to ask if they can update the record. The robots are all capable of adding machine readable descriptions of what has taken place to the wave that are not necessarily visible to the human participants but are available for automated parsing of the record.

Data analysis procedures could be handled by taking a Wave received from the output of an instrument, and including a robot analysis tool as a participant on the wave. Again the analysis robot might present a set of options as a formatted table or drop down menu to the researcher, and send the data, and the parameters to a web service, notifying the researcher when the results are back simply by adding them to the Wave. Again all the automated participants can add appropriate markup to the local wave and/or to a global record or feed to describe what has happened in a machine readable form via the use of appropriate ontologies and vocabularies.

Assisted authoring: Supporting the inclusion and effect use of controlled vocabularies and linking

This is conceptually extremely powerful where well structured and unambiguous requirements are available for a specific type of experiment. However it is less appropriate when such descriptions do not exist, where an experiment is exploratory in nature and does not have a well defined expected outcome, or in any case where the tools are not integrated. In this case the researcher is most likely to write a free text description of what they are doing and what they plan. As part of the Wave introduction two particular demos got a very positive, even adulatory, response. These were a real time, contextually sensitive, spell checker, that could for instance determine whether "bean" or "been" was the correct spelling in two adjacent sentences, and a real time translator that provided a running translation of an English text into French.

These were examples of programmatic objects within the Wave that can modify content by referring objects and events, like the addition of text, to outside services, such as a spell checking web service. It is not too large a leap from this to imagine exactly the kind of "inline help" described above where the document offers you alternative terms or references, automatically adding links to the appropriate objects or concepts online, complete with the links carrying a full semantic load. Several such services could be acting within one document; one that checks a feed of recently printed sample labels to suggest which specific sample you are referring to, a similar service that checks for new data files, and one or more services that monitor your preferred set of controlled vocabularies to suggest the right terms for you to use.

None of these approaches is impossible with currently existing tools. Such a system could be put together using PHP, XML descriptions of vocabularies, intelligent sample management systems. However such systems would be specific, taking the user away from their existing authoring and communication systems. What Wave provides is a framework in which all of the tooling that makes these systems easy to develop is already provided in what will possibly be the default communication and authoring environment for a significant proportion of global users. The future penetration and effectiveness of the Wave protocol remains to be seen but with Google behind it and it being an open federated protocol the prospects are positive.

Provenance and Security

A major concern with a distributed model of the research record is security and provenance. Security issues are raised both because valuable content is distributed amongst systems that are not under the institution's control and therefore may be lost due to technical or financial failure, but also because valuable intellectual property is potentially wandering free on the web. On top of this there are concerns about the reliability of provenance records; who deposited a specific piece of data or document, and when did they do so? Third party date stamps and identifiers might, at a naive level, seem a good record of priority but the standards and reliability of these third party services is rarely validated to any specific standard. In terms of provenance trails, how a specific statement or datafile has been arrived at, existing consumer focused tools are generally very poor at assisting the user in describing what derivative material they have used. Slideshare, while supporting Creative Commons licensing of slidesets does not provide any easy means of describing where a given slide or image derives from. Given that it is only recently that an intellectual property case has been concluded with a contribution from an electronic laboratory notebook [###REF###], and the lack of widely accepted standards for maintaining and describing security and provenance on the web it will not surprising if corporate users in particular are unconvinced by the idea of a distributed and openly available model of their research records.

There are three main answers to such a criticism; the most water tight from a security perspective is that the whole network can be maintained behind a corporate firewall. Thus the internal community has access to all the functionality provided by the networked record but the content is "safe" from outside eyes. Such an arrangement loses the value of being wired into the wider information infrastructure of the public web but for large enough companies this may be acceptable. Hybrid inside/outside systems may be feasible but are likely to be porous, especially given the weak point in any security system is almost always the people who use it.

The second possible answer to criticism is that we need to develop the tool set and standards that lets us describe and validate security and provenance of objects on the web. At one level the reliability of the description of objects where multiple services and multiple people are point at them should be much greater than that where one or two, often internal, signing authorities are being relied on. This distributed network of date stamps and comments while more diffuse is actually much more difficult to fake as it requires multiple different records, often created by different people, on different services and servers to be changed. It is also however more difficult to aggregate and document in a way that makes sense in the context of today's record keeping systems and legal records.

The final answer is to say that if you're attempting to build a collaborative research record that leverages networks and the web and you are focussed on protecting your content from other people then you are missing the point. The entire purpose of the exercise is to get the most out the information and the people manipulating it on the web by providing content that they can re-use, and in the process, pass useful information and connections back to you. If you do this in a partial way then you only get a partial benefit. It is only by allowing people outside your organization to wire your samples, data, and ideas into the wider linked open web that you gain benefits over and above what you can gain by simply using your own internal expertise more effectively. For very large companies an entirely internal web may be a viable strategy but for most small and medium companies or public research organizations the gains are potentially much greater than the losses. This doesn't mean there is a free for all. Provenance and identity are also important in this view, but here it is less about demanding that credit be assigned and more about gaining the most from connections that other people make by understanding who they are and where they are coming from.

Again the Wave protocol offers some interesting potential solutions to these problem. Parts of a given wave can have different participant lists and even be hosted on different servers. It is possible for two researchers from Company A to have a conversation on a wave that is on a public server hosted by Organization B without any of the other participants being able to see their conversation. In fact, their private conversation never needs to leaves the Company A Wave Server and can stay behind the corporate firewall. This level of granularity of viewing rights on a distributed collaborative document is unprecedented and will have significant implications for how we think about the management of valuable documents and objects. There is also the suggestion that the full power of code version repositories will be applied to tracking of Waves, both within the history of a single wave, which was demonstrated with the "play back" function as part of the demo, but also when new waves are spawned from existng waves, tracking their antecedents and what was passed and when, as well as enabling "fork" and "merge" operations of multiple independent versions of the same wave.

Whether or not Wave offers a technical solution the problems of attribution, provenance, and access rights are general ones that are facing the consumer web as users gradually gain a more sophisticated understanding of what privacy means in a networked world. If the view is accepted, that in wiring the research record into the wider web that the benefits gained outweigh the losses, then it follows that issues of access control with respect to viewing objects are less important. However being able to reliably identify the author of a dataset or other object becomes much more important in determining how you will respond to the way they have interacted with your materials. Therefore provenance becomes a key issue and reliable access control mechanisms that provide strong identities are crucial. There will no doubt be much development over the next few years of mechanisms for tracking the network of citations between objects on the web and using this to assign priority and precedence of ideas and statements. All of these will improve the toolset available to work with distributed research records of the type being discussed here.


The web and its organization fundamentally challenges the idea of the research record as a single document or record and removes the constraints created by a physical record or document to enable a multifaceted, multifunctional, and multimedia laboratory notebook. The web itself has evolved from its original form of linked static documents to dynamic services and sites populated by objects created by their users and content automatically aggregated and re-published in new contexts and new forms. The vision of the web as a dynamic network of addressable objects and their relationships can be traced right to its earliest origins in the early 1990s but it is only now being realized.

Fully exploiting the infrastructure and functionality of the web to create a web-native laboratory record requires re-thinking the traditional view of the laboratory notebook as a linear narrative of events. By creating individual addressable objects that refer to physical samples and materials, laboratory or computational processes, and data files, it is possible to create a dynamic record that can be directly linked into the growing semantic and linked data web. The web native laboratory notebook is a combination of the web of data and the web of things. If the best available services are used for each kind of object then the actual laboratory record can be reduced to a feed describing the relationships between these objects and functionality that has been specifically designed around those objects can be fully exploited.

The beauty of this approach is that it doesn’t require users to shift from the applications and services that they are already using, like, and understand. What it does require is intelligent and specific repositories for the objects they generate that know enough about the object type to provide useful information and context. What it also requires is good plugins, applications, and services to help people generate the lab record feed. It also requires a minimal and arbitrarily extensible way of describing the relationships. This could be as simple html links with tagging of the objects (once you know an object is a sample and it is linked to a procedure you know a lot about what is going on) but there is a logic in having a minimal vocabulary that describes relationships (what you don’t know explicitly in the tagging version is whether the sample is an input or an output). But it can also be fully semantic if that is what people want. And while the loosely tagged material won’t be easily and tightly coupled to the fully semantic material the connections will at least be there. A combination of both is not perfect, but it’s a step on the way towards the global data graph.

The technical challenges of implementing this vision are formidable. Authoring tools are needed, along with well designed repositories; a whole infrastructure of services and tools for pushing information between them. In this context the announcement of the Google Wave protocol is a very interesting development. The functionality that is described has an enormous potential to make the implementation of many of these tools much easier. The proof of this will be in the development of useful functionality within this platform. Google has the brand awareness and expertise to make such a revolution in online communication technology possible. They have set the agenda and it will be a very interesting story to follow.

The view of the laboratory record as a distributed set of objects on the open web is closely linked with the agenda of the Open Research Movement. The presumption is that the gains made by wiring your own ideas, samples, and results into the wider information graph, far outweigh the losses. Only limited gains could be made by adopting this architecture but keeping the content closed off from the wider web. Thus its adoption and development depends closely on the users view of the future of scientific communication. If you accept the vision of open communication and the idea that it will make your own research more competitive then this is a path to follow. The web was built for the sharing of data amongst research scientists. We are only just learning how to that effectively and efficiently.


This article draws heavily on conversations and discussions with too many people to name individually here. Particular thanks are due to FriendFeeders and Sciencetwists, members of the Frey group at the University of Southampton, and of the Unilever Centre for Molecular Informatics at Cambridge University, and other attendees and speakers of the 2009 SMi Electronic Laboratory Notebook conference (London) and Science Online 2009 (North Carolina).


Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution 3.0 License