AoAH Day 15: Porting a complete HTML5 parser and browser test suite / Dec 2025
After my success with
My question, though, is how difficult is to go in the other direction and move towards a strongly typed interface like OCaml's. Could we ultimately distill down the extremely complex set of rules around parsing HTML all the way into a proof assistant like Lean, but hopping via OCaml and Haskell to provide convenient executable pitstops?
Today's task was to vibespile the Python into ocaml-html5rw, a pure OCaml HTML5 parser and serialiser that passes the browser test suite 100%.
Approach
I took a very similar approach to my earlier
I also used an earlier trick to get my freshly generated OCaml test suite (which is itself parsing the third-party html5lib expect tests) to output a standalone HTML report. This was extremely useful to see progress, but also for me to understand what sort of thing HTML5 parsing involves. You can browse a snapshot to judge for yourself.

I'd never actually realised before doing this that, unlike many parsers, HTML5 parsing never actually fails. The WHATWG specification is a living standard that defines error recovery rules for almost every possible malformed input, ensuring all HTML documents produce a valid DOM tree just like browsers do. So, the biggest danger here is that we parse a minor syntax error into a nonsensical DOM that is "far away" from the author intention.
Results
The HTML5 port took a few hours, and the resulting library seems to pass the HTML5 tests without much drama. One footgun is that the test runner itself was quite complex, and hidden in there was some skipping of test cases. So my review of this library actually happens in reverse: I read the test runners first to figure out what's going on, only working backwards to the library itself. A very strange workflow...

There's nothing too surprising so far; after all
Avoid reinventing the wheel
A lot of the code inside the library looked suspiciously like it was reinventing the Unicode wheel, with lots of character-encoding specific manipulation. So I cloned some key libraries from the OCaml ecosystem that are both fairly standalone and well engineered: astring, uutf, and uunf by


I cloned a few candidate libraries, and the agentic analyse determined that only some of them were relevant to the problem at hand, so I discarded the rest and focussed on the top set.
The resulting diff crunched the size of the library down considerably. Since we had extensive test coverage, the 100% pass rate at the end built up some confidence that semantics weren't changed too badly. The existence of OCaml interface files also meant I could inspect those separately in the diff and be satisfied that only implementations had changed.
Types making browsing the specification fun
Modules and types are defining feature of OCaml, so I pointed the agent at the WHATWG standard and asked it to introduce explanations directly into the interface files themselves. This is quite different from an informal specification; the guidelines can be browsed directly and navigated around via the odoc HTML output.
When I was browsing the types, I realised that there were too many strings involved in parsing errors, and so the agent helped synthesise them into an extensible OCaml variant that describes things much more precisely. This has gotten me thinking about 'doing verification in reverse'; just like AI for maths is shifting what it means to do math, this sort of thing is shifting what it means to do formal specification.
Reflections
I'll end by asking the same questions that Simon did:
I’ll end with some open questions:
- Does this library represent a legal violation of copyright of either the Rust library or the Python one?
- Even if this is legal, is it ethical to build a library in this way?
- Does this format of development hurt the open source ecosystem?
- Can I even assert copyright over this, given how much of the work was produced by the LLM?
- Is it responsible to publish software libraries built in this way?
- How much better would this library be if an expert team hand crafted it over the course of several months? -- I just ported JustHTML from Python to JavaScript, Simon Willison, 2025
I feel the last question is answered most easily: an expert team that has access to these tools and the domain knowledge about HTML5 should be able to do a good job. While I have no experimental evidence in this domain about that fact, we did find earlier this year that
The question of copyright and licensing is difficult. I definitely did some
editing by hand, and a fair bit of prompting that resulted in targeted code
edits, but the vast amount of architectural logic came from JustHTML. So I
opted to make the LICENSE a joint one
with
I'm also extremely uncertain about every releasing this library to the central
opam repository, especially as there are excellent HTML5
parsers already available. I haven't
checked if those pass the HTML5 test suite, because this is wandering into the
agents vs humans territory that I ruled out in my
I note that throughout my entire AoAH adventure so far, most of the code I've generated has been spectacularly unoriginal. I've received tons of help and examples of how to use tools from colleagues as input, but the aggregate set of OCaml code output that I would class as refreshingly interesting from me is pretty minimal. So in the long term, I don't think this process is "helping the core" of the community by coming up with beautiful functional pearls.
However, the libraries are satisfying an important need for utility: some
things like Yaml and HTML5 are fundamentally such ugly formats that I find it
hard to argue for elegant solutions therein, and yet we need to manipulate them
to live in the real world. So I'm ending up with a utilitarian plea, with some
angst that this might be smothering the functional flame that make's OCaml so
much fun in the first place. I must ask
