/ Ideas / Towards reproducible URLs with provenance

This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is available for being worked on. It will be supervised by Patrick Ferris and Anil Madhavapeddy as part of my Planetary Computing project.

Summary

Vurls are an attempt to add versioning to URI resolution. For example, what should happen when we request https://doi.org/10.1109/SASOW.2012.14? The prototype vurl library written in OCaml outputs the following:

# Eio_main.run @@ fun env ->
  Vurl_eio.with_default ~net:env#net env#cwd @@ fun () ->
  let vurl = Vurl.of_uri "https://doi.org/10.1109/SASOW.2012.14" in
  let vurl, file = Vurl.file vurl in
  Vurl.pp Format.std_formatter vurl;;

{
  "intentional_uri": "https://doi.org/10.1109/SASOW.2012.14",
  "segments": [
    {
      "uri": "file:./_data/document-6498375",
      "cid": "bag5qgeraipjyvov4axsmb4pktfhmleqi4oc2lno5if6f6wjyq37w4ktncvxq"
    },
    {
      "uri": "https://ieeexplore.ieee.org/document/6498375/",
      "cid": "bag5qgeraipjyvov4axsmb4pktfhmleqi4oc2lno5if6f6wjyq37w4ktncvxq"
    },
    {
      "uri": "http://ieeexplore.ieee.org/document/6498375/",
      "cid": "bag5qgerap5iaobunfnlovfzv4jeq2ygp6ltszlrreaskyh3mseky5osh2boq"
    }
  ]
}

The intentional_uri is the original URI, and the segments are the different versions of the document as tracked through HTTP redirects and so on. The cid is a content identifier tgat is a hash of the content retrieved in that snapshot. The file is the local file that the URI resolves to.

This project will build on the vurl concept to build a practical implementation that integrates it into a popular HTTP library (in any language, but Python or OCaml are two good starts), and also builds a simple proxy service that can be used to resolve these URLs. The web service should be able to take a normal url and return the content of the URL at that point in time, and also return a vurl representing the complete state of the protocol traffic, and also be able to take a vurl and return the diff between two versions of the content.

Once successful, the project could also explore what more compact representations of the vurls would look like, and how to integrate them into existing web infrastructure.

Related reading

Related Ideas