Composable diffing for heterogenous file formats
This is an idea proposed in 2024 as a Cambrige Computer Science Part III or MPhil project, and is available for being worked on. It will be supervised by Patrick Ferris and Anil Madhavapeddy.
When dealing with large scale geospatial data, we also have to deal with a variety of file formats, such as CSV, JSON, GeoJSON, or GeoTIFFs, etc. Each of these file formats has its own structure and semantics, and it is often necessary to compare and merge data across different file formats. The conventional solution with source code would be to use a tool such as Git to compare and merge data across different file formats. However, this approach is not always feasible, as it requires the data to be in a text-based format and the data to be structured in a way that can be compared line by line.
This project explores the design of a composable diffing specification that can compare and merge data across heterogenous file formats. The project will involve designing a domain-specific language for specifying the diffing rules, and implementing a prototype tool that can compare and merge data across different file formats. Crucially, the tool should be composable, meaning that it should be possible to combine different diffing rules to compare and merge data across different file formats.
As an evaluation, the project will apply the composable diffing specification to real-world dataset used in our Remote Sensing of Nature projects, and compare the results with a conventional approach using Git.
Related reading
- Uncertainty at scale: how CS hinders climate research has relevant background reading on some of the types of diffs that would be useful in a geospatial context.
- Planetary computing for data-driven environmental policy-making covers the broader data processing pipelines we need to integrate into.
- "Generic type-safe diff and patch for families of datatypes", Eelco Lempsink (2009) is a principled library in Haskell for constructing type safe diff and patch functions using GADTs.
- diffi: diff improved; a preview, Gioele Barabucci (2018) is a comparison tool whose primary goal is to describe the differences between the content of two documents regardless of their formats.
Related News
- Planetary computing for data-driven environmental policy-making / Mar 2024
- Uncertainty at scale: how CS hinders climate research / Feb 2024
- Remote Sensing of Nature / Jan 2023
- Planetary Computing / Jan 2022