Processing XHTML 1.1 Documents with MSXML

Can anybody rewrite this document in valid English? Contact me.

日本語版は ./xhtml-msxml.ja を参照して下さい。

Table of Contents

Abstract

The current version of MSXML cannot process the normal XHTML 1.1 strictly conformance documents without errors. This document describes what we should do for processing XHTML 1.1 documents with MSXML.

Processing of Ignore Sections (MSXML 3.0)

Section 3.4 of XML 1.0 (Third Edition) says:

The contents of an ignored conditional section MUST be parsed by ignoring all characters after the "[" following the keyword, except conditional section starts "<![" and ends "]]>", until the matching conditional section end is found. Parameter entity references MUST NOT be recognized in this process.

Obeying this rule, in XHTML 1.1 DTD, there are two undefined parameter entity references (PERefs) in the IGNORE sections around the Modular Framework Module:

<!ENTITY % xhtml-prefw-redecl.module "IGNORE" >
<![%xhtml-prefw-redecl.module;[
%xhtml-prefw-redecl.mod;
<!-- end of xhtml-prefw-redecl.module -->]]>

(snip)

<!ENTITY % xhtml-framework.module "INCLUDE" >
<![%xhtml-framework.module;[
<!ENTITY % xhtml-framework.mod
     PUBLIC "-//W3C//ENTITIES XHTML Modular Framework 1.0//EN"
            "http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod" >
%xhtml-framework.mod;]]>

(snip)

<!ENTITY % xhtml-postfw-redecl.module "IGNORE" >
<![%xhtml-postfw-redecl.module;[
%xhtml-postfw-redecl.mod;
<!-- end of xhtml-postfw-redecl.module -->]]>

%xhtml-prefw-redecl.mod; and %xhtml-postfw-redecl.mod; are defined to simplify addtion of modules. These two PERefs are recognized only if %xhtml-prefw-redecl.module; or %xhtml-postfw-redecl.module; were set as INCLUDE.

But MSXML 3.0 doesn't implement this rule correctly. It always recognises PERefs, if they're inside of IGNORE sections. So if you proccess XHTML 1.1 documents with it, it tries to recognise the undefined PERef %xhtml-prefw-redecl.mod;, then returns an error:

Parameter entity must be defined before it is used. Error processing resource 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'. Line 85, Position 2

%xhtml-prefw-redecl.mod;
-^

To avoid this error, add the following declarations to the DTD.

<!ENTITY % xhtml-prefw-redecl.mod "" >
<!ENTITY % xhtml-postfw-redecl.mod "" >

# FYI: In section 3.4 of XML 1.0 first edition, there are no description about PERefs in IGNORE section. But practice from SGML (a superset of XML), it should be ignored at all. So, said rule is added to XML second edition.

Resolving Relative Path (MSXML 2.x)

There is the following description in XHTML 1.1 DTD:

<!-- declare Document Model module instantiated in framework
-->
<!ENTITY % xhtml-model.mod
     PUBLIC "-//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN"
            "xhtml11-model-1.mod" >

Where is the base of this relative system identifier? It is where this parameter entity declaration is. I.e., http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd. Because:

Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs. This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration. A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.

But MSXML 3.0 resolves this system identifier as relative to the location of the resource within which the parameter entity reference occurs. I.e., http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod. Therefore MSXML returns the following error report:

The specified object is not found. Error processing resource 'http://www.w3.org/TR/xhtml-modularization/DTD/xhtml11-model-1.mod'. Line 89, Position 18

%xhtml-model.mod;]]>
-----------------^

To avoid this error, redeclare the system identifier of xhtml-model.mod:

<!ENTITY % xhtml-model.mod
     PUBLIC "-//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11-model-1.mod" >

However, this declaration may cause a "busy" situation because of excess of resources. In such case, change the refered DTD to the flattened version (xhtml11-flat.dtd) rather than the original version (xhtml11.dtd).

<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11-flat.dtd" >

# FYI: If you use the flattened DTD, you can't use any place holders, e.g. %xhtml-qname.redecl;. It's already flattened in xhtml11-flat.dtd.

Processing Parameter Entity References (MSXML 2.x)

XHTML 1.1 DTD has the following description:

<!ENTITY % XHTML.pfx  "" >
<!ENTITY % caption.qname   "%XHTML.pfx;caption" >
<!ENTITY % table.content
     "( %caption.qname;?, ( %col.qname;* | %colgroup.qname;* ),
      (( %thead.qname;?, %tfoot.qname;?, %tbody.qname;+ ) 
      | ( %tr.qname;+ )))"
>
<!ELEMENT %table.qname;  %table.content; >

In processing an entity declaration, XML processors must process PERefs in the entity value immediately. So in this case, %caption.qname; is replaced when the processor processes the declaration of table.content, not reference %table.content;. That is, the replacement text of %table.content; is just ( caption?, ... ), not ( %caption.qname;?, ... ).

But probably MSXML 2.x (not 3.0 or later) doesn't process the PERefs in the entity values. It maybe replaces %table.content; by ( %caption.qname;?, ... ) and then replaces %caption.qname; by %XHTML.pfx;caption, and finally replaces %XHTML.pfx; by empty string.

According to Section 4.4.8 of XML 1.0, the replacement text of PERef (not within entity values) MUST be enlarged by the attachment of one leading and one following space (#x20) character. After all, %table.content; is replaced with ( caption ?, ... ) by MSXML 2.x. So it reports a well-formedness error.

To avoid this error, redeclare table.content and Ruby.content.complex with verbose parentheses.

<!ENTITY % table.content
     "( (%caption.qname;)?, ( (%col.qname;)* | (%colgroup.qname;)* ),
      (( (%thead.qname;)?, (%tfoot.qname;)?, (%tbody.qname;)+ )
      | ( (%tr.qname;)+ )))"
>
<!ENTITY % Ruby.content.complex 
     "| ( %rbc.qname;, %rtc.qname;, (%rtc.qname;)? )"
>

Errata of Modularization of XHTML

# Note: This error has been fixed.

There is the following description in XML-compatible ISO Special Character Entity Set for XHTML (-//W3C//ENTITIES Special for XHTML//EN) of Modularization of XHTML (first edition):

<!ENTITY amp     "&#38;&#38;"> <!--  ampersand, U+0026 ISOnum -->
<!ENTITY lt      "&#38;&#60;"> <!--  less-than sign, U+003C ISOnum -->

Here are two typos. The correct values for these entities are "&#38;#38;" and "&#38;#60;". According to section 4.6 of XML 1.0:

If the entities lt or amp are declared, they MUST be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is REQUIRED for these entities so that references to them produce a well-formed result.

Therefore, these typos cause well-formedness errors to all XHTML 1.1, XHTML Basic and WML 2.0 documents, unfortunately. So you shouldn't use these document types with the original DTD.

To avoid these errors, change the system identifier for -//W3C//ENTITIES Special for XHTML//EN to the correct version of it (e.g. http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent).

<!ENTITY % xhtml-special
     PUBLIC "-//W3C//ENTITIES Special for XHTML//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent" >

# FYI: These errata will be fixed in the second edtion of Modularization of XHTML.

# FYI: If you validate documents with The W3C Markup Validation Service, you aren't returned such an error. Maybe W3C's validator uses SGML catalog and bind the public identifier -//W3C//ENTITIES Special for XHTML//EN to the correct version of it.

# FYI: There are not these typos in XHTML Flattend DTD. Maybe this DTD was flattened with such catalogs.

# This error has been fixed in XHTML™ Modularization 1.1 (PR 2006-02-13). And the DTD modules of M12N 1.0 have been replaced by the modules of 1.1 PR.

XHTML 1.1 DTD for MSXML

After all, to process XHTML 1.1 document with MSXML 2.x/3.0, you should declare the DOCTYPE as follows:

Module-based
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" [
<!ENTITY % xhtml-prefw-redecl.mod "" >
<!ENTITY % xhtml-postfw-redecl.mod "" >

<!ENTITY % xhtml-model.mod
     PUBLIC "-//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11-model-1.mod" >

<!ENTITY % xhtml-special
     PUBLIC "-//W3C//ENTITIES Special for XHTML//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent" >

<!ENTITY % table.content
     "( (&#37;caption.qname;)?, ( (&#37;col.qname;)* | (&#37;colgroup.qname;)* ),
      (( (&#37;thead.qname;)?, (&#37;tfoot.qname;)?, (&#37;tbody.qname;)+ )
      | ( (&#37;tr.qname;)+ )))" >
<!ENTITY % Ruby.content.complex 
     "| ( &#37;rbc.qname;, &#37;rtc.qname;, (&#37;rtc.qname;)? )" >
]>
Flattened-based
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11-flat.dtd" [
<!ENTITY % xhtml-prefw-redecl.mod "" >
<!ENTITY % xhtml-postfw-redecl.mod "" >

<!ENTITY % table.content
     "( (&#37;caption.qname;)?, ( (&#37;col.qname;)* | (&#37;colgroup.qname;)* ),
      (( (&#37;thead.qname;)?, (&#37;tfoot.qname;)?, (&#37;tbody.qname;)+ )
      | ( (&#37;tr.qname;)+ )))" >
<!ENTITY % Ruby.content.complex 
     "| ( &#37;rbc.qname;, &#37;rtc.qname;, (&#37;rtc.qname;)? )" >
]>

External DTDs are:

Sample documents are:

# FYI: You can use these DTDs freely, but you must not be against W3C's lisence.

Reference

Status of This Document

URI
http://www.satoshii.org/markup/dtd/xhtml11-msxml (HTML, XHTML, 日本語版)
First Edition (English Version)
2004-09-30
Last Modified
2006-02-23
Editor
Satoshi ISHIKAWA
Copyright © 2001-2004, 2006 Satoshi ISHIKAWA, All Rights Reserved.