/ Ideas / Composable diffing for heterogenous file formats

This is an idea proposed in 2024 as a Cambrige Computer Science Part III or MPhil project, and is under discussion with a student but not yet confirmed. It will be supervised by Patrick Ferris and Anil Madhavapeddy as part of my Planetary Computing project.

Summary

When dealing with large scale geospatial data, we also have to deal with a variety of file formats, such as CSV, JSON, GeoJSON, or GeoTIFFs, etc. Each of these file formats has its own structure and semantics, and it is often necessary to compare and merge data across different file formats. The conventional solution with source code would be to use a tool such as Git to compare and merge data across different file formats. However, this approach is not always feasible, as it requires the data to be in a text-based format and the data to be structured in a way that can be compared line by line.

This project explores the design of a composable diffing specification that can compare and merge data across heterogenous file formats. The project will involve designing a domain-specific language for specifying the diffing rules, and implementing a prototype tool that can compare and merge data across different file formats. Crucially, the tool should be composable, meaning that it should be possible to combine different diffing rules to compare and merge data across different file formats.

As an evaluation, the project will apply the composable diffing specification to real-world dataset used in our Remote Sensing of Nature projects, and compare the results with a conventional approach using Git.

Related reading

Related Ideas