home Anil Madhavapeddy, Professor of Planetary Computing  

Effects based scheduling for the OCaml compiler pipeline

This is an idea proposed in 2025 as a good starter project, and is available for being worked on. It may be co-supervised with David Allsopp.

In order to compile the OCaml program foo.ml containing:

Stdlib.print_endline "Hello, world"

the OCaml compilers only require the compiled stdlib.cmi interface to exist in order to determine the type of Stdlib.print_endline. This separate compilation technique allows modules of code to be compiled before the code they depend on has necessarily been compiled. When OCaml was first written, this technique was critical to reduce recompilation times. As CPU core counts increased through the late nineties and early 2000s, separate compilation also provided a parallelisation benefit, where modules which did not depend on each other could be compiled at the same time as each other benefitting compilation as well as recompilation.

For OCaml, as in many programming languages, the compilation of large code bases is handled by a separate build system (for example, dune, make or ocamlbuild) with the compiler driver (ocamlc or ocamlopt) being invoked by that build system as required. In this project, we'll investigate how to get the OCaml compiler itself to be responsible for exploiting available parallelism.

Some previous work (parts of which are available on GitHub[1]) showed the benefits of sharing the typing information known to the compiler between each invocation. The hypothesis was during a sequential computation, a considerable amount of time is spent by the compiler searching for and reloading typing information, as well as the overheads of launching thousands of copies of the compiler in a given build.

Our test compiler with an adapted version of Dune showed as much as a halving of compilation time in sequential builds. However, in parallel builds, the results were not as impressive - although the many invocations of the compiler repeat the same loading operations, much of this cost is (quite predictably) masked by performing the work in parallel.

The previous investigation was carried out on OCaml 4.07. Although it shared the typing information between "invocations" of the compiler, the compiler pipeline itself was unaltered - a file only started to be processed when all of its dependencies were ready. Furthermore, it remained the responsibility of a build system to provide this dependency ordering.

Fast forward to the present day, and we have OCaml 5.x, with both first class support for parallelism and algebraic effects. Domains provide an obvious ability for a single compiler process to compile several files simultaneously. Effects should allow us to break the pipeline into stages, suspending the compilation whenever new type information is required by performing an effect. Using this model, it should be possible to start with the entry module for a program and allow the type checker itself to discover the dependency graph. it should be possible to see many files being progressively type-checked in parallel.

The hypothesis is that this will be both faster, but also considerably simpler. The "scheduler" required for handling the effects should be a considerably simpler program than a full-blown separate build system. Key challenges in this work:

  1. see dra27/ocaml#nandor-dune-work, dra27/dune#nandor-shmap, and nandor/offheap.

    ↩︎︎
# 1st Apr 2025   iconideas effects functional idea-available idea-beginner ocaml urop

Related News