/ Ideas / Distributed Task Scheduling Framework over Irmin

This is an idea proposed in 2019 as a Cambridge Computer Science Part II project, and has been completed by Mohammed Daudali. It was supervised by Anil Madhavapeddy as part of my Unikernels project.

Summary

Distributed computation and task scheduling frameworks can be decentralised with minimal cost to performance. Furthermore, this decentralisation can provide a significant reduction in the trusted computing base and complexity of the system, affording end consumers a greater level of confidence in the integrity of the results. Moreover, carefully designed persistent and transient data structures can augment this confidence by providing strong isolation guarantees in a multi-tenant system, whilst retaining full transparency over the dynamic data flow graph. This can all be achieved with an API that interfaces directly with conventional developer tools, enabling end users to easily verify that the computation directly aligns with their expectations. Detailed metadata can ensure a fair and transparent pricing structure for both service providers and consumers by carefully tracking the resource usage. Together, this allows open-source communities to remain completely transparent whilst providing non-developer end users a simpler and more accessible downloadable package that can be independently verified.

This project will investigate building a composable task scheduler over Irmin. The core of this project started with a single server model, in which a large number of workers can independently clone and interact with a persistent job queue CRDT. Crucially, each worker schedules tasks using only local knowledge, giving a high probability that at least two workers are working on the same task. This has a twofold benefit - completed work can be independently verified by a number of different workers, and two, work in progress by stragglers can be selected by other workers, which can result in a lower time to completion. By independently sampling and verifying work, we remove the need for implicitly trusting individual workers. Adversaries must now compromise all worker nodes to have the required effect - compromising N - 1 workers results in a non-zero probability of the attack being detected. Given a heterogeneous set of worker machines, all under the control of different and independent entities, this attack becomes significantly harder. The project will investigate suitable sampling schedules for calculating the pareto frontier of over-committing work versus cluster throughput.

Related reading

Links

Related Ideas