There are many discussions on the web regarding the merits of using web standards.
The argument against using web standards can be summarized as: "who cares?!" If the browsers render my code correctly then I am accomplishing my goal.
The arguments in favor of using web standards can be summarized as providing improved cross browser compatibility and improved maintenance cost from using cleaner code.
I believe there is an important point that is only tangentially discussed and that should be addressed much more emphatically. What does Google do when it encounters non standard compliant HTML? Does it affect your search results? We need to constantly remember that search engine bots are "users" of our sites and they are not necessarily as tolerant with our markup as normal browsers. A SEO expert gave me the best trick to understand what Google sees and what it doesn't see while going through a page. It goes like this:
- Open the page you want to evaluate in your favorite browser
- Click Select All
- Click Copy
- Open NotePad
- Click Paste
Whatever gets printed in NotePad is what Google is indexing.
But lets go into some examples of how the wrong HTML can affect the results of your search.
Missing alternate descriptions
First example of how following standards can improve your interaction with Google. Google is "blind". Google only sees the text that is embedded in your page; no images, no java script, no animations... One of the rules that are required by standards is to provide an alternate description for non textual information (alt attribute in XHTML). If you do not include an alternate description you are missing an opportunity to provide information to Google. The use of ALT descriptions is a best practice enforced by using web standards that affect search results!
Wrong / Missing DOCTYPE
The second example relates to the use of DOCTYPE. The DOCTYPE is used to specify to the browser what kind of markup to expect. Is it HTML? Is it XHTML? No DOCTYPE? (Google replies: well... let's guess). And the last thing you need is for Google (or any browser for that matter) to guess how to interpret your source code. Chris Maunder from The Code Project has an excellent example of how Google can get confused if you specify a certain DOCTYPE and then you write code following a different standard. In certain cases Google simply stops indexing the page and it assumes it was a 404 Page Not Found error. The example that Chris shows reflects how a simple miss-closed tag (ultimately a missing "/") can avoid the indexing of a page. Syntax correctness, which is enforced by using web standards, is important when you try to have Google index your page! UPDATE: In general this is an example of the Tag Soup problem. The right thing to do is to make sure your web site validates according to a standard like XHTML transitional.
Lack of / Incorrect use of Entities
Ahh... entities... Isn't it painful having to follow the rule set by web standards and escape every special character? Well, it might be painful but Google reacts to non escaped characters in very peculiars ways. Let's first look into the most obvious one. If you write in a foreign language that requires characters with an accent like for example Spanish, French or Italian depending on how you code your information with entities the search results may vary.
Second, there are issues with escaped vs unescaped characters in URLs. This webmasterworld article is an example of how wrong usage of entities can cause confusion.
Third, when you use scripting to generate markup, the way in which you write your script can also confuse Google as Chris Maunder also explains in his article. If you try to generate code without escaping the right characters you can get in trouble. Web Standards enforce the proper use of entities, another reason to follow them to avoid search engine confusion.
Missing required page elements
There are a number of attributes in a page that are either required or recommended by web standards that can definitely increase or decrease your page rank. One of the suggestions that many SEO experts have is to make sure a page contains at least the following attributes:
h1: Every page should have one and only one h1. This tag should be used to express the main idea described in the page. In general heading tags should not be used only for styling but to semantically mark the content in the page. Google pays special attention to h1 content when indexing.
title: Every page should have one and only one title. The title should be related to h1. Google looks at the relationship between h1 and title when indexing.
meta tags: Every page should have a number of meta attributes (description, keywords, etc.). These keywords are taken into account by Google while indexing and they also provide semantic information about the page that when properly used can improve the user experience while surfing the web.
Again, web standards remind you of the proper usage of these attributes and therefore can help you improve your search results.
Separation between Content and Style
Web standards teach you about separation between content and style which is an incredibly useful practice per se with regards to improving maintainability. It also clearly has some advantages with respect to Google behavior. The first one is bandwidth savings. If your styling information is in a separate css file, since Google does not care about style, then it will now crawl it and therefore you will not be spending bandwidth in this manner. But in addition to bandwidth savings (which can be major for high trafficked sites), there is a limit to the size of a page that is indexed by search engines. So, if your page is not "polluted" by styling then it can have more content! Additionally, if your style contains syntax errors it can confuse Google and this is a way to avoid it. UPDATE: A very good practice is to avoid HTML tables as a mechanism to layout the information on a table. This should be done using style markup (CSS).
Web standards practices help you direct your efforts with respect to this separation.
Unmarked text: no semantics
Many times web developers simply compy and paste text into a web page. The resulting markup is basically just text separated with BRs. As of today I do not believe search engines penalize this behavior, but moving forward it will be more and more important to make sure every piece of text contains as much semantic as pssible. For now, the minimum semantic that a piece of text should contain is basic HTML markup like P, UL, Hx, etc. This information can help search engines understand the priority and context of the content. Additionally, unmarked txt is very hard to style and maintain therefore it is a good practice anyways. UPDATE: There are some newer standards like microformats that can add semantic information to a page without affecting the rendering of the information. Even if at this moment it is not clear how microformats will affect search results presumably they will be important in the near future.
Conclusions
It is clear from the examples above that not following web standards can have a huge impact on your search results! From not providing the best information to index a page to Google not indexing a page at all because of syntax errors in the markup, even if the page looks good in the browsers! UPDATE: Aarron Walter just published a very good findability strategy checklist that has a complete section on markup and additional sections on server and client side code.
It is true that you can avoid most of the mistakes shown here without the need to completely follow web standards, but they are super useful as a guideline and as best practices to follow when programming web pages. Next time you look at your page you can have Aggiorno by your side helping you with all the time-consuming tasks necessary to make a page XHTML compliant.
With Aggiorno we are promoting web standards by eliminating a lot of the tedious work that is required to make a page validate. By doing so we are helping pages improve their stance towards search engines. In particular:
Aggiorno can help you find missing alternate descriptions
Aggiorno can help you make your code structure XHTML compliant
Aggiorno can help you convert special characters into appropriate entities
Aggiorno can help you with content-style separation
Aggiorno can help you with text semantication