Web Site Validation: The Fundamentals
Quick Pop Quiz:
- Do you have a web site?
- Does it validate?
- Against which DOCTYPE?
- Do you know how to tell?
- Do you know what it means?
- Do you know why you should care?
By Gareth Powell
Web site validation is an issue that will become increasingly important with the emergence of newer, even more standards-compliant browsers. Browsers which, for the first time, will treat with contempt – possibly even reject – ill-formed websites. Upcoming versions of both Firefox and Explorer promise rigorous standards-compliant rendering. Are you up to the challenge?
So how bad does a Web site have to be to be “invalid”? Indeed this seems to be a surprisingly hard test to pass. For example, these random well-known Web sites could be described – as invalid at the time of writing:
This article reviews the W3C definition of valid documents, as well as common problems that Web sites experience trying to be valid, ways to test a Web site for validity, and, most importantly, how to make your own Web sites valid.
What is Validation?
With apologies to those who want to get to the meat of this article, I’d like to spend a few moments reviewing the fundamentals of HTML, and how it applies to Web site layout and design.
HTML is not, in fact, SGML
HTML was originally envisaged as yet another SGML “application”. SGML1 is a standard means of specifying ways of specifying document content. That is why, at least from the SGML point of view, an HTML document is all but completely unreadable without the HTML decoder “codebook” – called the HTML Document Type Definition (DOCTYPE or DTD). The document type definition is, quite literally, the specification of all the HTML elements, and how they should be used.
However, generic SGML parsers are neither small nor trivial – they are, after all, programs designed to understand documents in a language which is itself described in a document in a language which the SGML parser does natively understand. In other words, SGML parsers are complex beasts, heavy on system resources.
In 1994, as companies rushed to provide Web browsers that would work on small 486-based PCs running Windows 3.1, the idea of using a full SGML parser – let alone building one – just to interpret a tag language that was very simple to understand and render was simply out of the question.
The consequence of this was that from 1994 to 2000, the definition of what HTML actually was, as well as what it meant, was not the standard HTML DTD, but rather the implicit, virtual, formally undefined parsing and rendering code inside Mosaic, Netscape Navigator and Internet Explorer. The DTD definition barely managed to follow, being updated to reflect the latest features – forms, tables, frames – that were added to the increasingly proprietary browsers.
Missing In Action: The DOCTYPE declaration
The DOCTYPE declaration is required in all valid SGML documents. After all, without it, it would be impossible for an SGML parser to know the language in which the document is written. However, with Web browsers having the knowledge about HTML hard-coded inside them in the early days of the web, this was almost invariably omitted as useless. Even today, Web books and sites2 introducing people to how to build Web pages leave it out as an “advanced detail” that can be introduced later. 3
Will the real HTML please step forward...
Along with the explosion of different browsers and proprietary extensions to HTML on the Web, came the impossibility of predicting the exact behavior in all browsers of pages using advanced features, especially those involving interaction, or styling and layout. This being the case, the Web consortium (W3C) – a standards body – proposed a standard feature set for content (HTML), styling (CSS), and interaction (DOM and ECMAScript) 4. At the same time, Microsoft added to IE5 (initially only for the Mac version), the ability to control the behavior of the rendering engine by specifying an appropriate DOCTYPE at the top of the page (a behavior currently known as DOCTYPE switching). However – for a variety of reasons – the behavior of that browser was not simply to reject documents with the wrong DOCTYPE, but rather to make guesses if the document did not appear to match it. Subsequently, all the popular browsers have copied this approach, and most documents on the Web today claim to match a particular DOCTYPE, without necessarily doing so.
Testing for document correctness
Therefore, in order to determine whether or not your page is valid according to its DOCTYPE, it is not sufficient to load it in a browser – the browser will try and render even blatantly invalid documents. Because of this, even when the browsers are trying to render in “standards” mode, they must make allowances for what web designers might have meant. Again, we find ourselves in a vicious circle of defining standards that are interpreted in non-standard ways.
Instead, it is necessary to use a tool that actually contains an SGML parser and compares the structure of the document against the declared DOCTYPE. One such tool is the W3C´s Validator (http://validator.w3.org), which can operate directly on a given URL, an uploaded file, or merely on pasted text.
The validator reports obvious errors (such as missing or misplaced tags), subtler errors such as non-standard or proprietary attributes and provides warnings about other usages that may be incorrect or simply confusing to browsers.
Looking forward
The theory being, of course, that with a set of standards that all browsers should follow, and documents that can be proved correct against that same standard, very soon Web sites will work consistently on all browsers, on all platforms.. It’s important to realize that it’s not just the everyday browsers you need to concern yourself with. Special users – such as a search engine's indexing bot - need to be considered too in this analysis.5
There is more to standards than just validation of course. There are many ways of generating the same visual effects on the screen, but not all of them have the same “meaning” when analyzed automatically6, and it’s important to search engine optimization to make sure that your pages have their intended meaning to their analyzers.
But this article is just about the fundamentals of web validation, and the semantics of a web page is too complex to get into here.
How do I get there?
All this standards talk sounds great, but if you’re like many developers, it also sounds a million miles away. For a start, you’ve got to keep a dozen Web sites up and running, there’s a handful of upgrades that your users urgently need, and you don’t really know what these standards are after all, let alone want to have a tool tell you that your web pages have hundreds of “errors”. Besides, everyone you know uses IE6 and everything works fine7. Doesn't it?
The answer, of course, is to make changes a little bit at a time. There’s no need to take a couple of weeks overhauling every single site you’re responsible for. In fact, the W3C provides two DOCTYPEs – called “transitional” – especially to help people like you make the transition. Clearly this transition could be hugely accelerated with automated tool assistance.
Let’s look at some issues
These examples are taken from a real random Web site, namely www.hotelpuntaleona.com, a popular destination in Costa Rica:
Line 246, Column 58: end tag for element "P" which is not open.
<h3>Costa Rica Hotel Recognition</h3></p>
In this case, the error is caused by an unnecessary closing tag, which needs to be removed.
Line 232, Column 5: character data is not allowed here.
<
This appears to just be a typo – an extra opening tag character has been put here, but has not been followed by the tag name. This is an obvious error in HTML, and should be removed.
Line 156, Column 10: element "NOEMBED" undefined.
<noembed>
The embed and noembed tags were supported in earlier versions of HTML, but are no longer valid in HTML 4. In this case, the embed tag has been replaced by the object tag, and the noembed tag is completely deprecated (it is replaced by the simple content of the tag).
Line 120, Column 79: there is no attribute "LEFTMARGIN".
<body text="#FFFFFF"
link="#FFFFFF" vlink="#F2F2F2" alink="#FFFFFF"
leftmargin="0" topmargin="0"
onLoad="MM_timelinePlay('Timeline1')">
In HTML 4, styling is handled using CSS, and thus such attributes are no longer available on the body tag. They should be removed and the appropriate style information be added as a rule on the body type in a stylesheet file or section.
Line 95, Column 27: required attribute "TYPE" not specified.
<script language=JavaScript>
All script elements in HTML need a “type” attribute, which is usually “text/javascript” for the JavaScript language.
But I don’t have time for this!
OK, so maybe that was just a little too overwhelming. Let’s see: that one page has 20 errors, and if that can be considered typical, and you have 1000 pages that you need to update, that’s 20,000 changes: and that’s once you’ve understood what the new rules are. You definitely don’t have time for that.
Fortunately, there are alternatives. Some HTML editors these days will help by generating corrected HTML once you specify the DOCTYPE. If you’re building complex sites using Microsoft’s ASPX controls in Visual Studio, you can turn on inline Validation Checking against a specific DOCTYPE and it will display the errors relative to that DOCTYPE. These also appear in the error browser. This functionality can be remarkably helpful when developing a site from scratch.
Aggiorno (www.aggiorno.com), an add-in for Microsoft Visual Studio, is also capable of automatically fixing many of these common issues. Using an extensible language, so-called Aggiornings automatically improve your Web pages, either as whole files or in sections.
Let’s wrap up, then
This article has covered a fair amount of territory at a high level. In summary, the way forward for the Web is conforming to standards: the browsers (IE7, IE8, Firefox 3) and the bots (Yahoo!, Google) are all going there, and web designers like you should take advantage of this to build sites to a single standard design, rather than tackling each browser as an individual.
It’s worth noting that these standards are practically mature these days: HTML 4 (1999), XHTML 1 (1999), and CSS 2 (1998). There are now tools to check the validity of documents, and increasing support for these standards is embedded in the authoring tools.
Nevertheless, it’s felt pointless trying to use these standards before now – the lack of browser support has made any efforts in this direction a frustrating distraction – but the promise of the new browsers to more thoroughly and universally embrace these standards makes now the time for all of us to make the switch.
But we all have a lot of pages to deal with, often legacy ASPX pages, and it can be frustrating and time-consuming to trace back from validation errors to the source of the problem, and to get to understand how the issue should be addressed, even before actually addressing it. But emerging tools such as Aggiorno can be of tremendous help by automating the process of moving from legacy web sites to standards.
1 SGML is covered in a large number of resources. If you want to become an expert, you should probably google for references. As a starter, try:
2 For example (not to pick on anyone), try http://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic
3 The most common DOCTYPEs in use today are the HTML 4.01 and XHTML 1.0 “Transitional” DOCTYPEs, which are given by:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
and
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
4 ECMAScript is the European standard “version” of JavaScript. See http://www.ecma-international.org/publications/standards/Ecma-262.htm
5 See, for example http://www.codeproject.com/KB/server-management/Google_Indexing_Problem.aspx
6 There is a wide range of information out there about making your pages intelligible to search bots, but consider for example http://dev.opera.com/articles/view/semantic-html-and-search-engine-optimiza/, and going up through the El Dorado of “the semantic web” (for example http://en.wikipedia.org/wiki/Semantic_Web).
7 There’s a large body of information out there on the web about issues with transitioning to standards-based design. For example:
Appendix: Non-compliance with typical websites
In the introduction, three websites were listed with “typical” non-compliance issues. The following table summarizes how they are non-compliant:
|
GE
http://www.ge.com |
Ford
http://www.ford.com |
Google
http://www.google.com |
| Doctype Specified |
XHTML 1.0 Strict |
XHTML 1.0 Transitional |
None |
| Total Validation Errors |
1 |
29 |
51 |
| Missing DOCTYPE |
|
|
2 |
| Missing Open Tags |
|
|
3 |
| Missing Close Tags |
|
|
|
| Invalid Tag Nesting |
1 |
|
2 |
| Unquoted &, < or > |
|
17 |
25 |
| Missing ALT attribute |
|
6 |
|
| Missing Type attribute |
|
4 |
2 |
| Non-existent or proprietary tags used |
|
|
2 |
| Non-existent, proprietary or deprecated attributes used |
|
2 |
3 |
| Missing Content |
|
|
|
| Missing Quotes on Attributes |
|
|
11 |