Fun anecdote told by John Resig (check the post for more details and links):
One of the first implementations of the HTML 5 parsing rules was actually created to power the HTML 5 validator. [...] This particular implementation is in Java [...] Henri Sivonen (the author of the validator) just recently landed [...] a brand new HTML 5 parsing engine in Gecko, destined for the next version of Firefox. What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.The Webkit blog has more on HTML5 parsing. It lists three main advantages:
- Better interoperability between browsers.
- Better compatibility with web pages. Apparently, lots of effort and web crawling went into designing the HTML5 parsing algorithm.
- SVG and MathML can be embedded in HTML.
We’ve been implementing the HTML5 parsing algorithm in phases. Two months ago, we finished the first phase, which consisted of the tokenization algorithm. Late last night, we finished the second major piece: the tree builder algorithm. Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code. In the next phase, we’ll tackle fragment parsing (which is used by innerHTML and HTML5test.com).This is one important argument against this kind of lenient parsing: You cannot easily implement an HTML parser, which is a piece of software that has many uses (browsers, crawlers, screen scrapers, transformers, ...). I wonder if we should introduce a strict mode for HTML5 for people who actually want to write clean, easily parsable, HTML.