By Cesar Muñoz
As web developers, we have all used an HTML editor at some time. Some of these editors make an extensive use of an HTML/XHTML parser. The more advanced the provided functionality, the more complex and flexible the underlying parser needs to be. If you include ASP.NET into the equation, the parsing complexity increases even more. Some of the main challenges that we have encountered when creating an ASP parser that allows for full control of the source code are discussed in this post.
Parsing ASP/ASP.NET or HTML source code is necessary to perform tasks like the following:
- Region coloring.
- Document content analysis.
- Executable code analysis.
- Problem identification.
- Code transformation.
Some of these tasks require the parser to work on fragments of source code and preserve the following information which is normally discarded by a parser whose only goal is to run the code:
- Tag case.
- Literal text and spaces.
- Line breaks.
- Comments.
- ASP inline code blocks.
- Attribute and element order.
- Element location.
If these were not enough, another complication comes from the lack of strict standards in most of HTML documents and to some extent, in ASP/ASP.NET documents. A considerable percentage of these documents are syntactically incorrect, they have missing or wrong elements, missing or wrong attribute and attribute values, as well as structural problems like overlapped or misplaced blocks of code.
All these considerations must be taken into account for a parser to be useful in an ASP/HTML development environment. Additional complications and incompatibilities between XML and HTML are mentioned by Jeff Heaton, http://www.developer.com/net/csharp/article.php/10918_2230091_1. It is also important to take into account that the correct parsing and interpretation of a document depends on the specified doctype (http://www.alistapart.com/articles/doctype/).
Now that we have listed some of the challenges, let’s explore a situation that will happen when parsing ASP/ASP.NET source code; it illustrates why a specifically developed parser is necessary.
ASP controls with mixed content
New ASP.NET control tags can define nested sections that contain normal HTML tags. This type of structure has the following pattern:
<aspTag>
<section1>
HTMLContent1
</section1>
<section2>
HTMLContent2
</section2>
…
<sectionN>
HTMLContentN
</sectionN>
</aspTag>
If HTMLContent1, HTMLContent2, etc are independent they can be parsed and processed without additional preprocessing. This is not the case when there are dependencies between HTMLContent1, HTMLContent2, for example:
<asp:Repeater runat="server">
<HeaderTemplate>
<table>
</HeaderTemplate>
<ItemTemplate>
<tr><td>
<%# DataBinder.Eval(Container.DataItem, "Title") %>
<hr>
<%# DataBinder.Eval(Container.DataItem, "Abstract") %>
</td></tr>
</ItemTemplate>
<FooterTemplate>
</table>
</FooterTemplate>
</asp:Repeater>
This type of dependency breaks the XML scheme and will make a normal XML or HTML parser fail.
Solution strategy
The solution involves one preparation step and additional considerations when working with the parser output.
Preparation step
The section opening and closing tags will be flattened and converted to a special tag (UnknownTag) that can be identified by the parser component users. In this way the different HTML content sections are reunited for parsing and transformation purposes.
Example:
[SOURCE]
<asp:Repeater runat="server">
< HeaderTemplate>
<table>
</HeaderTemplate>
< ItemTemplate>
<tr><td>Hello world!</td></tr>
</ ItemTemplate>
<FooterTemplate>
</table>
</FooterTemplate>
</asp:Repeater>
[PREPARED]
<asp:Repeater runat="server">
<UnknownTag “@AISHeaderTemplate” />
<table>
<UnknownTag “@AISHeaderTemplateClose” />
<UnknownTag “@AISItemTemplate” />
<tr><td>Hello world!</td></tr>
<UnknownTag “@AISItemTemplateClose” />
<UnknownTag “@AISFooterTemplate” />
</table>
<UnknownTag “@AISFooterTemplateClose” />
</asp:Repeater>
This flattened structure can be parsed and the result can be the input for other processes.
Transformation considerations
The tags generated in the preparation step must be ignored by all transformations and must remain in the final transformation result.
Pretty-printing step
The tags generated in the preparation step will need to be restored to the original tag.
We have seen what types of problems will need to be considered when writing and using a parser for real-life, probably incomplete and XML incompliant ASP and HTML documents. In future discussions we will consider more specific problems like error recovery.
Happy parsing!