What are XML and XHTML?

This is the first installment of a series of blog posts to discuss HTML, XML and XHTML, what they are and why more and more people are moving from HTML to XHTML when producing web content.

This first article looks at the history of these markup languages.

In the beginning…

HTML and the Web have grown up together. Originally, HTML was a simple markup language defined using SGML, which was intended for the creation of documents that referenced each other. Using hyperlinks, these references could be easily followed, rather than needing to go back to the catalogs to search for the referenced articles.

As the Web grew in reach and popularity, more and more functions (such as fonts, tables, forms) were added to HTML by the browser manufacturers working in competition, delivering exciting new features that only worked on one browser. Designing web sites became more challenging, more fun and more frustrating at the same time.

Then came the standards…

As time moved on, it became clear that this way of working was hindering, rather than helping, the development of the web. The W3C decided to increase its involvement, and its HTML 4 and CSS 1 “recommendations” became the first standards to lead, rather than follow the technology.

Defining a clear and clean “standard” way of expressing documents, these two standards are the core of what HTML content on the Web should be today. Unfortunately, the browsers have generally failed to fully adopt the W3C’s recommendations.

Then came XML…

HTML is an SGML application, specific to the generation of web page content. While this can be used to describe “anything” to other people, the document structure stops it being more immediately comprehensible to machines. The markup only enables browsers to understand the relative significance of the different pieces of text; it does not allow the creation of more complex structures, for example to describe all the properties of different food products, such as nutrition information.

XML was created to allow the definition of computer-comprehensible descriptions of artifacts. Using HTML’s heritage from SGML, XML appears superficially similar. But while the rules of HTML define the exact vocabulary of concepts that can be conveyed, XML was designed to be extensible: that is, the set of concepts which can be directly conveyed – through computer-comprehensible markup – is unlimited. As XML evolved, interoperable schemas were introduced, which allowed a single XML document to incorporate multiple distinct content types, all machine-readable, all machine-verifiable.

And XHTML …

With XML came the notion of XML languages to describe different types of content. One of the first to be adopted was the definition of hypertext documents – XHTML. With the same semantic description as HTML, XHTML documents appear similar to their HTML relatives, but nevertheless carry some key advantages.

Aggiorno Beta1 Available!

The Aggiorno Team is very happy to announce that Beta1 of our product is available as an add-in for Visual Studio 2005/2008.

The first beta version of Aggiorno is ready, and we want you to try it out. Squeeze it, bend it, criticize, suggest, laugh, and cry. The more you smash it against your daily sites, the more thankful we will be. Whenever you feel a button should be red, tell us. Whenever intuition whispers you in the ear that our UI is not intuitive, please please please tell us.

Write to support@aggiorno.com, or, browse to http://aggiorno.com/forums and post your question.

To download please go to: http://www.aggiorno.com/download.aspx

Thank you so much for willing to help us shake the world of Web standards!

Clean, crisp, accessible HTML – why not?

Web developers and designers continually face choices about the web sites that they build. Often it seems that obtaining a desired look forces the design away from standards or making an accessible website automatically makes it “boring”. In actual fact, these are false choices, and especially with the advent of more standards-compliant browsers, it’s possible to construct fully standards-compliant Web sites with great accessibility.

We just released a white paper by Gareth Powell that addresses the main issues on how to make a page accessible.  You can find it in the following link

White Paper: Clean, crisp, accessible HTML – why not?.  Send us your comments in this thread!

Enjoy!

Web Site Validation: The Fundamentals

Web site validation is an issue that will become increasingly important with the emergence of newer, even more standards-compliant browsers. Browsers which, for the first time, will treat with contempt – possibly even reject – ill-formed websites. Upcoming versions of both Firefox and Explorer promise rigorous standards-compliant rendering.

Are you up to the challenge?  Gareth Powell explores the fundamental notions on web site validation, DOCTYPE validation and its consequences in the white paper: "Web Site Validation: The Fundamentals".  Check it out, it is worth it!

 

Aggiorno Facebook

We have recently started a Facebook page on Aggiorno.  In this page you will find a community of users sharing their thoughts and experiences with the product.

In the mean time you can find some information about the dev team and a short video on the product.

Register and become an Aggiorno FAN on Facebook.

ASP/ASP.NET preserving parsing challenges

By Cesar Muñoz

As web developers, we have all used an HTML editor at some time. Some of these editors make an extensive use of an HTML/XHTML parser. The more advanced the provided functionality, the more complex and flexible the underlying parser needs to be. If you include ASP.NET into the equation, the parsing complexity increases even more. Some of the main challenges that we have encountered when creating an ASP parser that allows for full control of the source code are discussed in this post.

Parsing ASP/ASP.NET or HTML source code is necessary to perform tasks like the following:

  • Region coloring.
  • Document content analysis.
  • Executable code analysis.
  • Problem identification.
  • Code transformation.

Some of these tasks require the parser to work on fragments of source code and preserve the following information which is normally discarded by a parser whose only goal is to run the code:

  • Tag case.
  • Literal text and spaces.
  • Line breaks.
  • Comments.
  • ASP inline code blocks.
  • Attribute and element order.
  • Element location.

If these were not enough, another complication comes from the lack of strict standards in most of HTML documents and to some extent, in ASP/ASP.NET documents. A considerable percentage of these documents are syntactically incorrect, they have missing or wrong elements, missing or wrong attribute and attribute values, as well as structural problems like overlapped or misplaced blocks of code.

All these considerations must be taken into account for a parser to be useful in an ASP/HTML development environment. Additional complications and incompatibilities between XML and HTML are mentioned by Jeff Heaton, http://www.developer.com/net/csharp/article.php/10918_2230091_1. It is also important to take into account that the correct parsing and interpretation of a document depends on the specified doctype (http://www.alistapart.com/articles/doctype/).

Now that we have listed some of the challenges, let’s explore a situation that will happen when parsing ASP/ASP.NET source code; it illustrates why a specifically developed parser is necessary.

ASP controls with mixed content

New ASP.NET control tags can define nested sections that contain normal HTML tags. This type of structure has the following pattern:

     <aspTag>

          <section1>

               HTMLContent1

          </section1>

          <section2>

               HTMLContent2

          </section2>

          …

          <sectionN>

               HTMLContentN

          </sectionN>

     </aspTag>

If HTMLContent1, HTMLContent2, etc are independent they can be parsed and processed without additional preprocessing. This is not the case when there are dependencies between HTMLContent1, HTMLContent2, for example:

     <asp:Repeater runat="server">

          <HeaderTemplate>

               <table>

          </HeaderTemplate>

          <ItemTemplate>

               <tr><td>

               <%# DataBinder.Eval(Container.DataItem, "Title") %>

               <hr>

               <%# DataBinder.Eval(Container.DataItem, "Abstract") %>

               </td></tr>

          </ItemTemplate>

          <FooterTemplate>

               </table>

          </FooterTemplate>

     </asp:Repeater>

This type of dependency breaks the XML scheme and will make a normal XML or HTML parser fail.

Solution strategy

The solution involves one preparation step and additional considerations when working with the parser output.

Preparation step

The section opening and closing tags will be flattened and converted to a special tag (UnknownTag) that can be identified by the parser component users. In this way the different HTML content sections are reunited for parsing and transformation purposes.

Example:

[SOURCE]

     <asp:Repeater runat="server">

          < HeaderTemplate>

               <table>

          </HeaderTemplate>

          < ItemTemplate>

               <tr><td>Hello world!</td></tr>

          </ ItemTemplate>

          <FooterTemplate>

               </table>

          </FooterTemplate>

     </asp:Repeater>

[PREPARED]

     <asp:Repeater runat="server">

          <UnknownTag “@AISHeaderTemplate” />

               <table>

          <UnknownTag “@AISHeaderTemplateClose” />

          <UnknownTag “@AISItemTemplate” />

               <tr><td>Hello world!</td></tr>

          <UnknownTag “@AISItemTemplateClose” />

          <UnknownTag “@AISFooterTemplate” />

               </table>

          <UnknownTag “@AISFooterTemplateClose” />

     </asp:Repeater>

This flattened structure can be parsed and the result can be the input for other processes.

 

Transformation considerations

The tags generated in the preparation step must be ignored by all transformations and must remain in the final transformation result.

Pretty-printing step

The tags generated in the preparation step will need to be restored to the original tag.

We have seen what types of problems will need to be considered when writing and using a parser for real-life, probably incomplete and XML incompliant ASP and HTML documents. In future discussions we will consider more specific problems like error recovery.

Happy parsing!

Benchmarking DOCTYPE validation in Fortune 500's Web sites

By Federico Zoufaly

Maybe the first thing you ask yourself when you begin to learn about the world of Web Standards is just what their current state of use is.  How many companies are currently following Web standards?  Given the size of the Web, we decided to try and address this question by analyzing the home pages of the complete Fortune 500 list.  These companies are used as a benchmark for many aspects of the business world, so why not use them as a benchmark regarding the adoption of Web standards?

Results are very interesting and I would say even somewhat surprising...

Here comes the first set of interesting results of 498 pages that have been analyzed (the remaining two were not available at the time of the analysis).

1) DOCTYPE Declaration

For a browser to attempt to interpret a Web page using a certain standard, the DOCTYPE declaration of the page must be analyzed. If a page does not specify a DOCTYPE or if the DOCTYPE it specifies is incorrect, then the browser will render it in quirks mode.  Out of the Fortune 500 companies’ home pages, 169 of them (34%) do not declare a DOCTYPE, 60 of them (12%) declare a DOCTYPE but the declaration is incorrect or the DOCTYPE points to an outdated type. The rest or 268 of the home pages (54%) do specify a correct DOCTYPE.  That's almost half of the entrance doors to the world's most powerful companies simply being left to be rendered in a hope-you-can-guess-my-original-design-intentions way!!

2) Rendering Mode

Some DOCTYPEs are designed as a transition towards more strict standards; there are really three types of rendering supported by common browsers: quirks, almost standard and full standard modes.  Out of the Fortune 500 home pages, 229 (46%) render in quirks mode, 245 (49%) in almost standard mode and only 23 (5%) render in full standard mode.

Doctype USE: Fortune 500 Home Pages

3) XHTML vs HTML

There are two equally valid and common set of standards used for markup in Web pages: XHTML or HTML.  Both provide transitional and strict DOCTYPES.  Out of the Fortune 500 set, only 328 (66%) specify a DTD, out of which 169 (34%) HTML and 159 (32%) XHTML.  Furthermore, 59 out of 169 HTML pages render in quirks mode, while only 1 XHTML page renders in quirks mode.  Additionally, 10 HTML pages render in full standard mode and 100 render in almost standard mode.  On the other hand, 13 XHTML pages render in full standard mode and 145 render in almost standard mode.  This shows that 110 HTML pages render in either full or almost standard mode, while 158 XHTML pages follow the same behavior.

From the above results we can draw some interesting remarks:

As seen from the above statistics, there is definitely an interest among Fortune 500 companies to comply with Web standards, however almost 50% of them still do not support a correct DOCTYPE declaration.  Interestingly enough, 12% of them probably think they are working in a standard mode (since they try to declare a DOCTYPE), but they fail to declare a correct type.

It is important to observe that the fact that a Web page contains a correct DOCTYPE declaration does not mean that the page is syntactically correct or that it validates against a Web standard (this is a topic for a future post); it only shows that there has been some effort from the developer side to move towards Web standards.

On the topic of XHTML vs HTML as a Web standard there is no clear separation among Fortune 500 companies, however, it seems, from the data above, that XHTML overall is used more consistently towards standards adoption than HTML.

The discussion on the Mozilla Developer Center about DOCTYPE sniffing, shows detailed information on how Mozilla attempts to figure out a page DOCTYPE.  It provides an insight on how complex the job of a Web browser is and how this complexity is derived directly from a lack of support of Web standards from organizations.  

The fact that rendering web pages is such a complex task because of the lack of use of standards is is probably the most important conclusion:, to make a better Web we need more standards adoption by the industry!

Comments?  What DOCTYPE declaration are you using?  Are you sure it is an accepted full or almost standard one?

Please let us know if you found some inconsistency in your site.  How do you compare to the Fortune 500?

TAG SOUP: The 10 most common scenarios for shuffled tags

By Daniel Alvarez.

In a previous post about the basics of shuffled tags, we talked about different ways of looking at the same problem: placing “end tags in the wrong order”, invalid or malformed markup, or HTML “tag soup”.

Is this a frequent issue in the web? Take a look at some statistics: in a sample of 1,132 random pages; we found 22.6 % of them with at least one problem of shuffled tags.

Why are they are so common? We humans are not good at closing tags. Also, if browsers don’t care about it, why should we? Our stand is that we should care.

Moving on, a closer look at the statistics reveals that the most frequent problems are:

· table structures: people are used to implementing extremely big html structures with tables;

· div are form structures: used for grouping and to organize page layout; and

· style family: composed by tags like p, b, i, and font.

Let me show an illustration of each one of the top 10 scenarios for shuffled tags in descending order of frequency. For each scenario, I will also show an equivalent non-shuffled version.

Note: **** appear where there is (potentially complex) markup that is immaterial to the shuffling.

#1 table, tr, td

Description: the principal problem resides in the misplaced closing of the already opened tags, in the most common pattern <tr> and <table> were closed before closing <td>   

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 1 Example</title>
</head>
<body
    **** 
   <table border="true"> 
        <tr
            <td
            Texto1 
        </tr
        **** 
    </table
            </td
    ****
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 1 Solution</title>
</head>
<body
    **** 
    <table border="true"> 
        <tr
            <td
              Texto1 
            </td>
       
</tr
        **** 
    </table>
   
****
</body>
</html>

#2 div, table, tr, td

Description: in this case the example shows that <tr> was closed before closing <td> and <div> is closed before closing <table>.  

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 2 Example</title>
</head>
<body
    **** 
    <div
        **** 
        <table border="true"> 
            **** 
            <tr
                <td
                **** 
            </tr
            **** 
              </td
           **** 
    </div
        </table>
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 2 Solution</title>
</head>
<body
    **** 
    <div
        **** 
        <table border="true"> 
            **** 
            <tr
                <td
                **** 
               </td> 
            </tr
            **** 
           ****
      
</table>
    </div>
</body>
</html>

#3 font, p

Description: in this case the example shows that <font> was closed before closing <p>.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 3 Example</title>
</head>
<body
   <div
       **** 
        <font
            **** 
            <p
               **** 
        </font
              </p
        <font
            **** 
            <p
               **** 
        </font
              </p
    </div>
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 3 Example</title>
</head>
<body
    <div
       **** 
        <font
            **** 
            <p
               **** 
              </p>
        </font
        <font
            **** 
            <p
               **** 
              </p>
         </font
    </div>
</body>
</html>

Note that after fix the shuffling problem already exist illegal containment problem of p inside font, illegal containment problems would be discussed soon in another post.

#4 td, table, tr, form

Description: in this case the example shows that <tr> and <table> were closed before closing <form> and also form is closed inside another <table>.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 4 Example</title>
</head>
<body
    <table
        <tr
            <td
                <table
                    <tr
                        <form
                            **** 
                    </tr
                </table
                <table
                    **** 
                        </form
                </table
            </td
        </tr
    </table>
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 4 Solution</title>
</head>
<body
    <table
        <tr
            <td
              ****

               
<table
                    <tr
                        <form> 
                      
 </form>
                    </tr
                </table
              ****
                <table
                </table
            </td
        </tr
    </table>
</body>
</html>

Note that after fix the shuffling problem already exist illegal containment problem of form inside tr, illegal containment problems would be discussed soon in another post.

#5 ul, li

Description: in this case the example shows that <ul> was closed before closing <li>.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head
    <title>Top 5 Example</title>
</head>
<body
    <div
        <ul
            <li
                **** 
                <ul
                    <li
                        **** 
                </ul
                    </li
                **** 
            </li
        </ul
    </div>
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
html xmlns="http://www.w3.org/1999/xhtml" >
<head
    <title>Top 5 Solution</title>
</head>
<body
    <div
        <ul
            <li
                **** 
                <ul
                    <li
                        **** 
              
     </li>
                </ul
                **** 
            </li
        </ul
    </div>
</body>
</html>

#6 table, tr, td, form

Description: in this case the example shows that <tr> was closed before closing <table>, <form> and <td>.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 6 Example</title>
</head>
<body
    <table
        **** 
        <tr
            <td
                <form
                    <table
                        **** 
        </tr
        **** 
                    </table
                </form
            </td
        **** 
    </table
    ****
</body>
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 6 Solution</title>
</head>
<body
    <table
        **** 
        <tr
            <td
                <form
                    <table
                        **** 
                    </table
                </form
            </td
            <td
                <form
                    **** 
                </form
            </td
            **** 
        </tr
    </table
    ****
</body>
</html>

#7 table, tr, td, div

Description: in the first case the example shows that <td> was closed before closing <div>, and in the second example <td, <tr>, and <table> were closed before close div

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 7 Example</title>
</head>
<body
    <table
        <tr
            <td
                <div
                **** 
            </td>
            **** 
                </div
        </tr
    </table>
</body>
</html>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 7 Example 2</title>
</head>
<body
    <form
        **** 
        <table
            <tr
                **** 
                <td
                    <div
                    **** 
                </td
            </tr
        </table
                    </div
                    **** 
    </form
    ****
</body>
 
</html>

Solution:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 7 Solution 1</title>
</head>
<body
    <table
        <tr
            <td
                <div
                **** 
                </div
            </td
            ****

       
</tr
    </table>
</body>
</html>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 7 Solution 2</title>
</head>
<body
    <form
        **** 
        <table
            <tr
                **** 
                <td
                    <div
                    **** 
                   </div> 
                </td
            </tr
        </table
        **** 
    </form
    ****
</body>
</html>

 

#8 form,table,tr,td

Description: in this case the example shows that <form> was closed before closing <td>, <tr> and <table>.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head
    <title>Top 8 Example</title>
</head>
<body
    **** 
    <form
        **** 
        <table
            <tr
                **** 
                <td
    </form
    **** 
                </td
    **** 
            </tr
        </table
    ****
</body>