The other day Marco Rinck tweeted something that’s very confusing, almost disturbing. Google’s HTML style guide suggests to omit optional HTML tags. Taking in mind that the page rank of your website is influenced (among other things) by whether your page has a clear layout and a good coding quality, following these guidelines is almost mandatory. Granted, I didn’t check whether this particular style guide is linked to the results shown in the webmaster tools, but you get the idea: if Google publishes a style guide on HTML and CSS, it’s going to have an impact on the market. So let’s have a close look at it.
What are optional HTML tags?
At first glance, the idea of declaring certain HTML tags “optional” and suggesting to omit them makes a lot of sense. Putting it in a nutshell, the idea is to make the code more human-readable. More to the point, the suggestion is to write HTML code the way the average human would write it if they didn’t happen to be a developer. Developers have learned to love clear structures, but that’s not the way most people think. A particularly fascinating property of the human mind is its ability to recognize patterns. Even better, it’s able to infer patterns where there are none.
Basically, that’s the idea of optional HTML tags: you don’t have to write an HTML tag if you can easily infer it from the context.
For instance, there’s no point in writing the
<html> tag. The document is shown in the browser, so we already know it’s an HTML document. Similarly, there’s no point in wrapping the
<head> tag around the
<title> tag, and the
<body> around the
<p> tag. Both forms and paragraphs make only sense in the body of an HTML document, so it’s easy to infer the
Being a human being – and a particularly lazy one – Google’s style guide fills my heart with joy. Mind you: having to end each and every paragraph with a
</p> tag before starting a new paragraph is really silly. Actually, it’s the number one mistake when I add a new page to our BootsFaces showcase.
On the other hand, the new style guide is exactly the opposite of what web designers have been taught during the last couple of years. I’m not entirely sure about the version numbers, but as far as I remember, starting with HTML4, XHTML was promoted as the new way to go. Former versions of HTML – or rather, the browser rendering engines – tended to be very sloppy. Or, putting it positively, they mapped the pattern inference capabilities of the human brain. The (then) new standard was to get rid of inference. The idea was to define everything in a clear, concise manner.
What about parsers?
I don’t know why this clean coding style was propagated, but my theory (hey, that’s pattern inference at it’s best!) is that the idea was to make HTML machine-readable. Clean XHTML code can be parsed by an XML parser. In fact, that was what allowed me to write BabbageFaces. JSF usually generates very clean XML code which can be read and analyzed by an XML parser. That’s what I did to reduce the size of the HTML code sent to the browser.
However, I consider BabbageFaces a nice but failed project, basically because hardly any JSF page fulfills the strict requirements of XML. The average JSF page renders code that can almost, but not quite be read by an XML parser. Almost always optional HTML tags are used. Browser vendors always knew that HTML pages are written by human beings, so they introduced pattern inference from day one. So browsers don’t have trouble displaying the HTML code generated by JSF pages. The XML parser of BabbageFaces quickly ran into trouble with the same pages. Which is sort of remarkable, given that the HTML code generated by JSF frameworks usually is a lot cleaner than HTML code written by humans.
We need new parsers!
Far be it from me to resist the new guidelines. Quite the contrary. In recent years, compiler manufacturers have learned a lot. The semicolon that’s required at the end of each Java statement originally was introduced to make it easier to implement parsers analyzing the source code. It took compiler manufacturers a couple of years to recognize that the semicolon can be induced from context.
There are many programs out there analyzing HTML code. I don’t know whether they cope with tag inference or not. Thing is, HTML4 made it official that they don’t have to, while HTML5 officially makes tag inference part of the HTML language. As a consequence, you can’t simply use an XML parser to parse HTML pages. You need a full-blown HTML parser. A short investigation showed that Jsoup may be such a parser, but please don’t take my word for it: neither did I try it, nor did I compare it with other parsers.
Wrapping it up
Actually, the Google HTML and CSS style guide give the best summary itself:
This approach may require a grace period to be established as a wider guideline as it’s significantly different from what web developers are typically taught.
As a human being, I welcome the new standard because it makes both reading and writing HTML pages easier. As someone who sometimes writes programs to analyze HTML pages, I’m not so sure. At least, the idea to simply use an XML parser to analyze an HTML page is doomed.