HTML5 parsing

[2011-01-21] dev, html5, webdev
(Ad, please don’t block)
John Resig has written a post on HTML5 (most people seem to write it without a space before the “5”) parsing. Before HTML5, parsing was ad-hoc. Netscape was the first popular web browser and its lenient parsing lead to many web pages containing all kinds of syntactical errors (the HTML “tag soup”). Afterwards, other web browsers mainly tried to replicate the bugs and features of the Netscape parser. Bad news: HTML5 did not opt for a much cleaner syntax. Good news: There is now a specification about how to parse HTML. It makes the tag soup the de facto standard. While it is scarily technical (you didn’t expect a grammar, did you?), it does bring stability for browser implementers.

Fun anecdote told by John Resig (check the post for more details and links):

One of the first implementations of the HTML 5 parsing rules was actually created to power the HTML 5 validator. [...] This particular implementation is in Java [...] Henri Sivonen (the author of the validator) just recently landed [...] a brand new HTML 5 parsing engine in Gecko, destined for the next version of Firefox. What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.
The Webkit blog has more on HTML5 parsing. It lists three main advantages:
  • Better interoperability between browsers.
  • Better compatibility with web pages. Apparently, lots of effort and web crawling went into designing the HTML5 parsing algorithm.
  • SVG and MathML can be embedded in HTML.
I’m a bit shocked how long it took them to implement the parser:
We’ve been implementing the HTML5 parsing algorithm in phases. Two months ago, we finished the first phase, which consisted of the tokenization algorithm. Late last night, we finished the second major piece: the tree builder algorithm. Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code. In the next phase, we’ll tackle fragment parsing (which is used by innerHTML and HTML5test.com).
This is one important argument against this kind of lenient parsing: You cannot easily implement an HTML parser, which is a piece of software that has many uses (browsers, crawlers, screen scrapers, transformers, ...). I wonder if we should introduce a strict mode for HTML5 for people who actually want to write clean, easily parsable, HTML.