Towards reproducible URLs with provenance
This is an idea proposed in 2024 as a Cambridge Computer Science Part II project, and is available for being worked on. It will be supervised by Patrick Ferris and Anil Madhavapeddy.
Vurls are an attempt to add versioning to URI resolution. For example, what should happen when we request https://doi.org/10.1109/SASOW.2012.14
and how do we track the chain of events that leads to an answer coming back? The prototype vurl library written in OCaml outputs the following:
# Eio_main.run @@ fun env ->
Vurl_eio.with_default ~net:env#net env#cwd @@ fun () ->
let vurl = Vurl.of_uri "https://doi.org/10.1109/SASOW.2012.14" in
let vurl, file = Vurl.file vurl in
Vurl.pp Format.std_formatter vurl;;
{
"intentional_uri": "https://doi.org/10.1109/SASOW.2012.14",
"segments": [
{
"uri": "file:./_data/document-6498375",
"cid": "bag5qgeraipjyvov4axsmb4pktfhmleqi4oc2lno5if6f6wjyq37w4ktncvxq"
},
{
"uri": "https://ieeexplore.ieee.org/document/6498375/",
"cid": "bag5qgeraipjyvov4axsmb4pktfhmleqi4oc2lno5if6f6wjyq37w4ktncvxq"
},
{
"uri": "http://ieeexplore.ieee.org/document/6498375/",
"cid": "bag5qgerap5iaobunfnlovfzv4jeq2ygp6ltszlrreaskyh3mseky5osh2boq"
}
]
}
The intentional_uri
is the original URI, and the segments
are the different versions of the document as tracked through HTTP redirects and so on. The cid
is a content identifier tgat is a hash of the content retrieved in that snapshot. The file
is the local file that the URI resolves to.
This project will build on the vurl concept to build a practical implementation that integrates it into a popular HTTP library (in any language, but Python or OCaml are two good starts), and also builds a simple proxy service that can be used to resolve these URLs. The web service should be able to take a normal url and return the content of the URL at that point in time, and also return a vurl representing the complete state of the protocol traffic, and also be able to take a vurl and return the diff between two versions of the content.
Once successful, the project could also explore what more compact representations of the vurls would look like, and how to integrate them into existing web infrastructure.
Related reading
- https://github.com/quantifyearth/vurl has some prototype code.
- Uncertainty at scale: how CS hinders climate research has relevant background reading on some of the types of diffs that would be useful in a geospatial context.
- Planetary computing for data-driven environmental policy-making covers the broader data processing pipelines we need to integrate into.
Related News
- Planetary computing for data-driven environmental policy-making / Mar 2024
- Uncertainty at scale: how CS hinders climate research / Feb 2024
- Planetary Computing / Jan 2022