DTD to W3C XML Schema

Ayub Khan
11 January 2006

Purpose
Screens
Features
Implementation
Risks/Issues
Resources



Purpose


Screens

TODO

Features

Users of this tool have the following features available.



Implementation

This tools has following components

DTD Parser:

A DTD Parser should parse DTD and represent its content in a easily accesible form. There are 3 types of parsers

1. Java based (content acessible via some api)
2. Non-Java based

These are
Due to non-java nature, we are not interested in these.

3. XML representation (content represented in an XML format)

This is our approach, and we want to use it, due to various reasons (one could apply a xsl stylesheet to map this xml
representation to get an XML schema)

4. Other types of parsers


DTD Syntax
XML rep.
Example
<!ELEMENT element-name category>

or

<!ELEMENT element-name (element-content)>
<element name="A">
<content category="<EMPTY|ANY|mixed|children">
xyz
</content>
</element>

where xyz is

<mixed>
<choice>
<sequence>
<pcdata value="text content">
<sequence>
<element ref="t:X" cardinal="?|+|*"/>
<element ref="t:Y" cardinal="?|+|*"/>
</sequence>
</sequence>
<pcdata value="text content">
</choice>
</mixed>

or

<children>
<sequence>
<element ref="t:X" cardinal="?|+|*"/>
<element ref="t:Y" cardinal="?|+|*"/>
</sequence>
</children>
  • <!ELEMENT A EMPTY>

will be represented as

<element name="A">
<content category="EMPTY"/>
</element>

  • <!ELEMENT A (X?|(Y+,Z*)) >

will be represented as

<element name="A">
<content category="children">
<children>
<choice>
<element ref="t:X" cardinal="?"/>
<sequence>
<element ref="t:Y" cardinal="+"/>
<element ref="t:Z" cardinal="*"/>
</sequence>
</choice>
</children>
</element-content>
</element>

<!ATTLIST element-name attribute-name attribute-type default-value>
  <attribute element="A">
<defintion name="a">
<type>
abc
</type>
<default-declaration>
xyz
</default-declaration>
</defintion>
</attribute>

abc is one of the following
  • <cdata>
  • <token type="token-type">
  • <sequence><enumeration value="enum1"/>...</sequence>
where token-type -> ID, IDREF, IDREFS, ENTITY, 
ENTITIES, NMTOKEN, NMTOKENS


and


xyz is one of the following

  • <required>
  • <implied>
  • <fixed value="some value"/>


  • <!ATTLIST A
    a (x|y|z) "x">
will be represented as

 <attribute element="A">
<definition name="a">
<type>
<sequence>
<enumeration value="x"/>
<enumeration value="y"/>
<enumeration value="z"/>
</sequence>
</type>
<default-declaration>
<fixed value="x"/>
</default-declaration>
</definition>
</attribute>

General Entities

<!ENTITY' Name Definition'>

<entity name="Name" type="general" >
  <definition>
 abc
   </definition>
</entity>

where abc is one of the following
  • <entity-value value="some value"/>
  • <external-id ndata="ndata val">
xyz
      </external-id>
     </choice>

where xyz is one of the following
  • <system sysid=""/>
  • <public pubid="" sysid=""/>
Example:

Internal entities:

<!ENTITY Pub-Status "This is a pre-release of the specification.">

is represented as

<entity name="Pub-Status" type="general" >
  <definition>
     <entity-value value="This is a pre-release of the specification."/>
   </definition>
</entity>

External Entities:

<!ENTITY open-hatch
SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml">

is represented as
<entity name="Pub-Status" type="general" >
  <definition>
     <external-id>
       <system sysid="http://www.textuality.com/boilerplate/OpenHatch.xml"/>
     </external-id>
   </definition>
</entity>

<!ENTITY open-hatch
PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"
"http://www.textuality.com/boilerplate/OpenHatch.xml">
<entity name="open-hatch" type="general" >
  <definition>
     <external-id>
       <public pid="-//Textuality//TEXT Standard open-hatch boilerplate//EN" sid="http://www.textuality.com/boilerplate/OpenHatch.xml"/>
     </external-id>
   </definition>
</entity>

<!ENTITY hatch-pic
SYSTEM "../grafix/OpenHatch.gif"
NDATA gif >
<entity name="hatch-pic" type="general" >
  <definition>
     <external-id ndata="gif">
       <system sysid="../grafix/OpenHatch.gif"/>
     </external-id>
   </definition>
</entity>
Parsed Entities

<!ENTITY %Name Difinition>
<entity name="Name" type="parsed" >
<definition>
 abc
   </definition>
</entity>

where abc is one of the following
  • <entity-value value="some value"/>
  • <external-id>
xyz
      </external-id>
     </choice>

where xyz is one of the following
  • <system sysid="">
  • <public pubid="" sysid=""/>
<!ENTITY % YN '"Yes"' >

will be represented as

<entity name="YN" type="parsed" >
  <definition>
     <entity-value value="'Yes'"/>
   </definition>
</entity>

DTD2XSD Mapper:

The mapper will have to provide mapping for Elements, Attributes, Entities and so on. See below

Note: the map example shown in table shows the complete map from DTD constructs to XML Schema, whether we
use XML representation (see above) or another DTD parser


Syntax:

<!ELEMENT element-name category>
or
<!ELEMENT element-name (element-content)>

DTD XML Schema
<!ELEMENT A (X,Y) >
<element name="A">
<complexType content="elementOnly">
<element ref="t:X">
<element ref="t:Y">
</complexType>
</element>
<!ELEMENT A (X|Y) >
<element name="A">
<complexType content="elementOnly">
<choice>
<element ref="t:X">
<element ref="t:Y">
</choice>
</complexType>
</element>
<!ELEMENT A (X|(Y,Z)) >
<element name="A">
<complexType content="elementOnly">
<choice>
<element ref="t:X">
<sequence>
<element ref="t:Y">
<element ref="t:Z">
</sequence>
</choice>
</complexType>
</element>
<!ELEMENT A (X?,Y+,Z*) >
<element name="A">
<complexType content="elementOnly">
<element ref="t:X" minOccurs="0">
<element ref="t:Y" maxOccurs="unbounded">
<element ref="t:Z" minOccurs="0" maxOccurs="unbounded">
</complexType>
</element>
<!ELEMENT A EMPTY >
  • no attributes
<element name="A">
   <complexType/>
</element>

or

<element name="A">
<complexType>
 <complexContent>
  <restriction base="anyType"/>
 </complexContent>
</complexType>
</element>
  • with attributes
<xs:element name="A">
 <xs:complexType>
   <xs:attribute name="a" type="<simpletype>"/>
 </xs:complexType>
...
</xs:element>
<!ELEMENT A (#PCDATA) >
This means A is an element with only character data, so the schema map 
would be

<xs:element name="A">
 <xs:complexType>
   <xs:simpleContent>
     <xs:extension base="basetype">
       ....
     </xs:extension>    
   </xs:simpleContent>
 </xs:complexType>
</xs:element>

OR

<xs:element name="A">
 <xs:complexType>
   <xs:simpleContent>
     <xs:restriction base="basetype">
       ....
       ....
     </xs:restriction>    
   </xs:simpleContent>
 </xs:complexType>
</xs:element>


Example:

instance document:

<A a="something">A's value</A>

Schema:

<xs:element name="A">
 <xs:complexType>
   <xs:simpleContent>
     <xs:extension base="xsd:string">
       <xs:attribute name="country" type="xsd:string" />
     </xs:extension>
   </xs:simpleContent>
 </xs:complexType>
</xs:element>
<!ELEMENT A ANY >
<element name="A">
<complexType>
<sequence minOccurs="1" maxOccurs="1">
<any namespace="ns" minOccurs="mn" maxOccurs="mx"
processContents="skip"/>
</sequence>
</complexType>
</element>

ns -> default Namespace
mn -> 0, mx -> unbounded
<!ELEMENT A (X) >
<element name="A">
<complexType content="elementOnly">
<element ref="t:X">
</complexType>
</element>
Syntax:

<!ATTLIST element-name attribute-name attribute-type default-value>

DTD XML Schema
<!ATTLIST A
a CDATA #REQUIRED>
<element name="A">
<complexType content="elementOnly">
<attribute name="a" type="string" use="required"/>
</complexType>
</element>
<!ATTLIST A
a CDATA #IMPLIED>
<element name="A">
<complexType content="elementOnly">
<attribute name="a" type="string" use="optional"/>
</complexType>
</element>
<!ATTLIST A
a (x|y|z) "x">
<element name="A">
<complexType content="elementOnly">
<attribute name="a" type="string" use="optional"/>
</complexType>
</element>
<!ATTLIST A
a (x|y|z) #REQUIRED>
<element name="A">
<complexType content="elementOnly">
<attribute name="a">
<simpleType base="string">
<enumeration value="x"/>
<enumeration value="y"/>
<enumeration value="z"/>
</simpleType>
</attribute>
</complexType>
</element>
<!ATTLIST A
a CDATA #FIXED "x">
<element name="A">
<complexType content="elementOnly">
<attribute name="a" type="string" use="fixed" value="x"/>
</complexType>
</element>
<!ATTLIST A
a ID #FIXED "x1">
similar to CDATA (See above)
<!ATTLIST A
a NMTOKEN #FIXED "x1:1">
,,
<!ATTLIST A
a NMTOKENS #FIXED "x1:1 x1:2 x1:3">
,,
<!ATTLIST A
a IDREF #FIXED "x1">
,,
<!ATTLIST A
a IDREFS #FIXED "x1 x2 x3">
,,


Risk/Issues

1. Effort required to develop & unit test will take 3-4 man-weeks
2. There may be some DTD constructs that may not be trivial to map to XML Schema.
3. In general using XSLT to translate large XML files would require huge memory, but in this case I am assuming DTD
typically would not be that big.
4. Stylesheets needs to be made modular so as to enable extensions

Resources