PRESTO - A WWW information architecture for Legislation and Public Information systems

This originally appeared on the O'Reilly blog, February 24, 2008 .

PRESTO is not something new: its basic ideas are presupposed in a lot of people’s thinking about the web, and many people have given names to various parts, but I don’t know that anyone has given a name to this package. In any case, this combination of ideas which seems to me to be the sweet spot of practicality for large public document sets seem to have escaped the way that we approach many problems and systems. However, the question I ask is “How else are you going to do it?

The elevator pitch for PRESTO is this:

“All documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs.”

I would see PRESTO as the kind of methodology that a government could adopt as a whole-of-government approach, in particular for public documents and of these in particular for legislation and regulations. The problem is not “what is the optimal format for our documents?” The question is “How can link to the important grains of information in a robust, technology-neutral way that only needs today’s COTS tools?” The format wars, in this area, are asking exactly the wrong question: they focus us on the details of format A rather than format B, when we need to be able to name and link to information regardless of its format: supra-notational data addressing.

PRESTO is a combination of three ideas:

  • Permanent URLs
  • REST
  • Object-oriented

Legal documents such as legislation have three characteristics: they are highly structured, they are highly voluminous, but they have highly varying value. So many documents do benefit from the classic SGML treatment, with semantic Full Monty markup, but many others are accessed so rarely there is little benefit in having high-level markup for them. And in fact many documents may be scanned images with no text at all, and full markup entails re-keying.

So what PRESTO does (and people familiar with SGML PUBLIC identifiers will get the drift, and even more so people familiary with ISO Topic Maps) is to say that there is a real importance in being able to have permanent names even for resource that don’t have really brilliant representation available.

In fact, the legal documents may not exist physically yt all: it may be a base document and an ammendment document. So we want a permanent URL for the idea of that document, and we want our system to deliver the best fit it can when we want to get the representation. And we want to allow multiple formats, because often the best representation may be client-dependent. !

Some people might understand it better if we say that PRESTO is about naming and structuring the configuration items for document sets, and forms a precondition for vendor-neutral implementations, and to support plurality. What PRESTO does is say that when we drill down into a document, we do not want to drill down using media-dependent or presentation-dependent accidents, but according to the editorial/rhetorical (i.e. “semantic”) substance.

So why do I say “How else are you going to do it?

The reason is because if you are wanting to build a large information system for the kinds of documents, and you want to be truly vendor neutral (which is not the same thing as saying that preferences and delivery-capabilities will not still play their part), and you want to encourage incremental, decentralized ad hoc and planned developments in particular mash-ups, then you need Permanent URLs (to prevent link rot), you need REST (for scale etc) and you need object-oriented (in the sense of bundling the methods for an object with the object itself, rather than having separate verb-based web services which implement a functional programming approach: OO here also including introspection so that when you have a resource you can query it to find the various operations available)

What would a concrete example be? Lets say we are a government and we have adopted PRESTO so all our legislatation is online with these kinds of permanent URLs including every numbered thing inside the legislation. Then we want to be able ask “What other laws reference Part 4 of this Act?” In PRESTO, we say “OK, the object here is Part 4, so we want to extend the URL for Part 4 to add a name which means the list of references.” So we would have a URL like http://www.eg.gov/laws/ChildProtectionAct1904/1993/Part4/Referenced so that this gives a new URL, hierarchically based on the object it was dependent on. What we don’t do is http://www.eg.gov/functions/getReferences?to=/laws/ChildProtectionAct1094/1993/Part4 (which is procedural/functional) and not http://www.eg.gov/laws/ChildProtectionAct1904/1993/Part4?query=Referenced (some people would think this is OK, I don’t have a particularly strong view at the moment.)

Now what happens when we try to access this resource, using an HTTP GET for example? Well, that depends entirely on what information that back-end has to go on. It might be an HTTP 404 error. It might be an HTML file with a list of links. It might be an XML file of XPaths. It is up to the client to cope with the data that is sent, not the server to send in a standard, universal format. But if we allow introspection, we can then ask the resource for a list of the resources available (and HTTP content negotiation can be used too, potentially.)

I guess a rule of thumb for a document system that conformed to this PRESTO approach would be that none of the URLs use # (which indicates that you are groping for information inside a system-dependent level of granularity rather than being system-neutral) or ? (which indicates that you are not treating every object you can think about as a resource in its own right that may itself have metadata and children.)

As I have mentioned, I don’t want to claim that PRESTO is remotely new. However, for many people it is not remotely obvious until pointed out. We need to learn from the WWW that making information available is the pre-condition to building more interesting higher-level systems. In fact, the WWW has demonstrated this twice: first with HTML where as we got more HTML we started to get more imaginative use of links, systems like Google for example; but second with XML where it was only when there started to be enough information available in XML that AJAX took off.

Now I am not saying that there is no room for content management systems. But the ephemeral details of how and where a document is stored or composed should be hidden from the user, for this kind of public information.

You might say “But doesn’t your best-fit” approach mean that the system cannot guarantee to provide enough information to allow the automatic construction of effective legislation from amendment instructions?” And my answer is “Sure!”: however all it means is that you can build amended document compilation systems only when you have the base and amendments in a suitable form, and the bottom line is that often you don’t. So you want an infrastructure that allows graceful degradation.

Anyone can make a perfect system with perfect data! PRESTO decouples representations from resource identification. The particular details of how to form the permanent URLs, which methods are available, etc are the next level of question.