Provenance And Security

A major concern with a distributed model of the research record is security and provenance. Security issues are raised both because valuable content is distributed amongst systems that are not under the institution's control and therefore may be lost due to technical or financial failure, but also because valuable intellectual property is potentially wandering free on the web. On top of this there are concerns about the reliability of provenance records; who deposited a specific piece of data or document, and when did they do so? Third party date stamps and identifiers might, at a naive level, seem a good record of priority but the standards and reliability of these third party services is rarely validated to any specific standard. In terms of provenance trails, how a specific statement or datafile has been arrived at, existing consumer focused tools are generally very poor at assisting the user in describing what derivative material they have used. Slideshare, while supporting Creative Commons licensing of slidesets does not provide any easy means of describing where a given slide or image derives from. Given that it is only recently that an intellectual property case has been concluded with a contribution from an electronic laboratory notebook [###REF###], and the lack of widely accepted standards for maintaining and describing security and provenance on the web it will not surprising if corporate users in particular are unconvinced by the idea of a distributed and openly available model of their research records.

There are three main answers to such a criticism; the most water tight from a security perspective is that the whole network can be maintained behind a corporate firewall. Thus the internal community has access to all the functionality provided by the networked record but the content is "safe" from outside eyes. Such an arrangement loses the value of being wired into the wider information infrastructure of the public web but for large enough companies this may be acceptable. Hybrid inside/outside systems may be feasible but are likely to be porous, especially given the weak point in any security system is almost always the people who use it.

The second possible answer to criticism is that we need to develop the tool set and standards that lets us describe and validate security and provenance of objects on the web. At one level the reliability of the description of objects where multiple services and multiple people are point at them should be much greater than that where one or two, often internal, signing authorities are being relied on. This distributed network of date stamps and comments while more diffuse is actually much more difficult to fake as it requires multiple different records, often created by different people, on different services and servers to be changed. It is also however more difficult to aggregate and document in a way that makes sense in the context of today's record keeping systems and legal records.

The final answer is to say that if you're attempting to build a collaborative research record that leverages networks and the web and you are focussed on protecting your content from other people then you are missing the point. The entire purpose of the exercise is to get the most out the information and the people manipulating it on the web by providing content that they can re-use, and in the process, pass useful information and connections back to you. If you do this in a partial way then you only get a partial benefit. It is only by allowing people outside your organization to wire your samples, data, and ideas into the wider linked open web that you gain benefits over and above what you can gain by simply using your own internal expertise more effectively. For very large companies an entirely internal web may be a viable strategy but for most small and medium companies or public research organizations the gains are potentially much greater than the losses. This doesn't mean there is a free for all. Provenance and identity are also important in this view, but here it is less about demanding that credit be assigned and more about gaining the most from connections that other people make by understanding who they are and where they are coming from.

Again the Wave protocol offers some interesting potential solutions to these problem. Parts of a given wave can have different participant lists and even be hosted on different servers. It is possible for two researchers from Company A to have a conversation on a wave that is on a public server hosted by Organization B without any of the other participants being able to see their conversation. In fact, their private conversation never needs to leaves the Company A Wave Server and can stay behind the corporate firewall. This level of granularity of viewing rights on a distributed collaborative document is unprecedented and will have significant implications for how we think about the management of valuable documents and objects. There is also the suggestion that the full power of code version repositories will be applied to tracking of Waves, both within the history of a single wave, which was demonstrated with the "play back" function as part of the demo, but also when new waves are spawned from existng waves, tracking their antecedents and what was passed and when, as well as enabling "fork" and "merge" operations of multiple independent versions of the same wave.

Whether or not Wave offers a technical solution the problems of attribution, provenance, and access rights are general ones that are facing the consumer web as users gradually gain a more sophisticated understanding of what privacy means in a networked world. If the view is accepted, that in wiring the research record into the wider web that the benefits gained outweigh the losses, then it follows that issues of access control with respect to viewing objects are less important. However being able to reliably identify the author of a dataset or other object becomes much more important in determining how you will respond to the way they have interacted with your materials. Therefore provenance becomes a key issue and reliable access control mechanisms that provide strong identities are crucial. There will no doubt be much development over the next few years of mechanisms for tracking the network of citations between objects on the web and using this to assign priority and precedence of ideas and statements. All of these will improve the toolset available to work with distributed research records of the type being discussed here.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution 3.0 License