DTD to W3C XML Schema

Ayub Khan
11 January 2006

Purpose
Screens
Features
Implementation
Risks/Issues
Resources



Purpose

  • To provide a tool for developers to convert from an existing DTD to W3C XML Schema

Screens

TODO

Features

Users of this tool have the following features available.

  • Allows conversion of complex, modularized XML DTDs and DTDs with namespaces to W3C XML Schemas
  • Convert a Document Type Definition (DTD) into a XML Schema (REC-xmlschema-1-20010502).
  • It can also map DTD entities onto XML Schema constructs (simpleType, attributeGroup, group).
  • Support for XML 1.0 DTD (input document)
  • Generated W3C XML Schema file
  • Once singel Wizard panel (for Converting a DTD to W3C XML Schema) provides
  • an option to specify target namespace
  • a location for the schema file that gets generated
  • has logic to validate the location
  • an OK, Cancel button.
  • a summary of conversion area shows up in the panel area.
  • In case of error needed during conversion, the summary area shows lines in red text background.
    • some of these errors could be the DTD parse time errors (something that a user can correct the dtd file)
    • conversion time errors are very rare to occur.
  • map DTD comments onto XML Schema documentation nodes.
  • Adds following annotation
    •  <xs:annotation>
        --  (from file: file:///C:/tmp/hello.dtd) --
       </xs:annotation>


    Implementation

    This tools has following components
    • A DTD Parser
    • DTD2XSD Mapper

    DTD Parser:

    A DTD Parser should parse DTD and represent its content in a easily accesible form. There are 3 types of parsers

    1. Java based (content acessible via some api)
    • MATRA: (on sourceforge - http://matra.sourceforge.net)
    • It gives me a model after parsing the DTD. Looks like there is issue A|B vs A,B is considered same in the generated DTD model.
  • DTD Parser (from WUTKA - http://www.wutka.com/dtdparser.html)
    • It provides me a tokenizer, from which I can get all the dtd constructs from the dtd file.
    • DTDParser is licensed under either an Apache-style license or the Lesser GPL (LGPL) license. (http://www.wutka.com/dtdparserlicense.html)
  • TRANG: (BSD license, see http://webhome.sfbay/OFR/BSDlicense.html for issues) http://lists.xml.org/archives/xml-dev/200301/msg00594.html
  • 2. Non-Java based

    These are
    Due to non-java nature, we are not interested in these.

    3. XML representation (content represented in an XML format)

    This is our approach, and we want to use it, due to various reasons (one could apply a xsl stylesheet to map this xml
    representation to get an XML schema)

    4. Other types of parsers
    • JavaCC
    • Antlr


    DTD Syntax
    XML rep.
    Example
    <!ELEMENT element-name category>

    or

    <!ELEMENT element-name (element-content)>
    <element name="A">
    <content category="<EMPTY|ANY|mixed|children">
    xyz
    </content>
    </element>

    where xyz is

    <mixed>
    <choice>
    <sequence>
    <pcdata value="text content">
    <sequence>
    <element ref="t:X" cardinal="?|+|*"/>
    <element ref="t:Y" cardinal="?|+|*"/>
    </sequence>
    </sequence>
    <pcdata value="text content">
    </choice>
    </mixed>

    or

    <children>
    <sequence>
    <element ref="t:X" cardinal="?|+|*"/>
    <element ref="t:Y" cardinal="?|+|*"/>
    </sequence>
    </children>
    • <!ELEMENT A EMPTY>

    will be represented as

    <element name="A">
    <content category="EMPTY"/>
    </element>

    • <!ELEMENT A (X?|(Y+,Z*)) >

    will be represented as

    <element name="A">
    <content category="children">
    <children>
    <choice>
    <element ref="t:X" cardinal="?"/>
    <sequence>
    <element ref="t:Y" cardinal="+"/>
    <element ref="t:Z" cardinal="*"/>
    </sequence>
    </choice>
    </children>
    </element-content>
    </element>

    <!ATTLIST element-name attribute-name attribute-type default-value>
      <attribute element="A">
    <defintion name="a">
    <type>
    abc
    </type>
    <default-declaration>
    xyz
    </default-declaration>
    </defintion>
    </attribute>

    abc is one of the following
    • <cdata>
    • <token type="token-type">
    • <sequence><enumeration value="enum1"/>...</sequence>
    where token-type -> ID, IDREF, IDREFS, ENTITY, 
    ENTITIES, NMTOKEN, NMTOKENS


    and


    xyz is one of the following

    • <required>
    • <implied>
    • <fixed value="some value"/>


    • <!ATTLIST A
      a (x|y|z) "x">
    will be represented as

     <attribute element="A">
    <definition name="a">
    <type>
    <sequence>
    <enumeration value="x"/>
    <enumeration value="y"/>
    <enumeration value="z"/>
    </sequence>
    </type>
    <default-declaration>
    <fixed value="x"/>
    </default-declaration>
    </definition>
    </attribute>

    General Entities

    <!ENTITY' Name Definition'>

    <entity name="Name" type="general" >
      <definition>
     abc
       </definition>
    </entity>

    where abc is one of the following
    • <entity-value value="some value"/>
    • <external-id ndata="ndata val">
    xyz
          </external-id>
         </choice>

    where xyz is one of the following
    • <system sysid=""/>
    • <public pubid="" sysid=""/>
    Example:

    Internal entities:

    <!ENTITY Pub-Status "This is a pre-release of the specification.">

    is represented as

    <entity name="Pub-Status" type="general" >
      <definition>
         <entity-value value="This is a pre-release of the specification."/>
       </definition>
    </entity>

    External Entities:

    <!ENTITY open-hatch
    SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml">

    is represented as
    <entity name="Pub-Status" type="general" >
      <definition>
         <external-id>
           <system sysid="http://www.textuality.com/boilerplate/OpenHatch.xml"/>
         </external-id>
       </definition>
    </entity>

    <!ENTITY open-hatch
    PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"
    "http://www.textuality.com/boilerplate/OpenHatch.xml">
    <entity name="open-hatch" type="general" >
      <definition>
         <external-id>
           <public pid="-//Textuality//TEXT Standard open-hatch boilerplate//EN" sid="http://www.textuality.com/boilerplate/OpenHatch.xml"/>
         </external-id>
       </definition>
    </entity>

    <!ENTITY hatch-pic
    SYSTEM "../grafix/OpenHatch.gif"
    NDATA gif >
    <entity name="hatch-pic" type="general" >
      <definition>
         <external-id ndata="gif">
           <system sysid="../grafix/OpenHatch.gif"/>
         </external-id>
       </definition>
    </entity>
    Parsed Entities

    <!ENTITY %Name Difinition>
    <entity name="Name" type="parsed" >
    <definition>
     abc
       </definition>
    </entity>

    where abc is one of the following
    • <entity-value value="some value"/>
    • <external-id>
    xyz
          </external-id>
         </choice>

    where xyz is one of the following
    • <system sysid="">
    • <public pubid="" sysid=""/>
    <!ENTITY % YN '"Yes"' >

    will be represented as

    <entity name="YN" type="parsed" >
      <definition>
         <entity-value value="'Yes'"/>
       </definition>
    </entity>

    DTD2XSD Mapper:

    The mapper will have to provide mapping for Elements, Attributes, Entities and so on. See below

    Note: the map example shown in table shows the complete map from DTD constructs to XML Schema, whether we
    use XML representation (see above) or another DTD parser

    • Elements

    Syntax:

    <!ELEMENT element-name category>
    or
    <!ELEMENT element-name (element-content)>

    DTD XML Schema
    <!ELEMENT A (X,Y) >
    <element name="A">
    <complexType content="elementOnly">
    <element ref="t:X">
    <element ref="t:Y">
    </complexType>
    </element>
    <!ELEMENT A (X|Y) >
    <element name="A">
    <complexType content="elementOnly">
    <choice>
    <element ref="t:X">
    <element ref="t:Y">
    </choice>
    </complexType>
    </element>
    <!ELEMENT A (X|(Y,Z)) >
    <element name="A">
    <complexType content="elementOnly">
    <choice>
    <element ref="t:X">
    <sequence>
    <element ref="t:Y">
    <element ref="t:Z">
    </sequence>
    </choice>
    </complexType>
    </element>
    <!ELEMENT A (X?,Y+,Z*) >
    <element name="A">
    <complexType content="elementOnly">
    <element ref="t:X" minOccurs="0">
    <element ref="t:Y" maxOccurs="unbounded">
    <element ref="t:Z" minOccurs="0" maxOccurs="unbounded">
    </complexType>
    </element>
    <!ELEMENT A EMPTY >
    • no attributes
    <element name="A">
       <complexType/>
    </element>

    or

    <element name="A">
    <complexType>
     <complexContent>
      <restriction base="anyType"/>
     </complexContent>
    </complexType>
    </element>
    • with attributes
    <xs:element name="A">
     <xs:complexType>
       <xs:attribute name="a" type="<simpletype>"/>
     </xs:complexType>
    ...
    </xs:element>
    <!ELEMENT A (#PCDATA) >
    This means A is an element with only character data, so the schema map 
    would be

    <xs:element name="A">
     <xs:complexType>
       <xs:simpleContent>
         <xs:extension base="basetype">
           ....
         </xs:extension>    
       </xs:simpleContent>
     </xs:complexType>
    </xs:element>

    OR

    <xs:element name="A">
     <xs:complexType>
       <xs:simpleContent>
         <xs:restriction base="basetype">
           ....
           ....
         </xs:restriction>    
       </xs:simpleContent>
     </xs:complexType>
    </xs:element>


    Example:

    instance document:

    <A a="something">A's value</A>

    Schema:

    <xs:element name="A">
     <xs:complexType>
       <xs:simpleContent>
         <xs:extension base="xsd:string">
           <xs:attribute name="country" type="xsd:string" />
         </xs:extension>
       </xs:simpleContent>
     </xs:complexType>
    </xs:element>
    <!ELEMENT A ANY >
    <element name="A">
    <complexType>
    <sequence minOccurs="1" maxOccurs="1">
    <any namespace="ns" minOccurs="mn" maxOccurs="mx"
    processContents="skip"/>
    </sequence>
    </complexType>
    </element>

    ns -> default Namespace
    mn -> 0, mx -> unbounded
    <!ELEMENT A (X) >
    <element name="A">
    <complexType content="elementOnly">
    <element ref="t:X">
    </complexType>
    </element>
    • Attributes
    Syntax:

    <!ATTLIST element-name attribute-name attribute-type default-value>

    DTD XML Schema
    <!ATTLIST A
    a CDATA #REQUIRED>
    <element name="A">
    <complexType content="elementOnly">
    <attribute name="a" type="string" use="required"/>
    </complexType>
    </element>
    <!ATTLIST A
    a CDATA #IMPLIED>
    <element name="A">
    <complexType content="elementOnly">
    <attribute name="a" type="string" use="optional"/>
    </complexType>
    </element>
    <!ATTLIST A
    a (x|y|z) "x">
    <element name="A">
    <complexType content="elementOnly">
    <attribute name="a" type="string" use="optional"/>
    </complexType>
    </element>
    <!ATTLIST A
    a (x|y|z) #REQUIRED>
    <element name="A">
    <complexType content="elementOnly">
    <attribute name="a">
    <simpleType base="string">
    <enumeration value="x"/>
    <enumeration value="y"/>
    <enumeration value="z"/>
    </simpleType>
    </attribute>
    </complexType>
    </element>
    <!ATTLIST A
    a CDATA #FIXED "x">
    <element name="A">
    <complexType content="elementOnly">
    <attribute name="a" type="string" use="fixed" value="x"/>
    </complexType>
    </element>
    <!ATTLIST A
    a ID #FIXED "x1">
    similar to CDATA (See above)
    <!ATTLIST A
    a NMTOKEN #FIXED "x1:1">
    ,,
    <!ATTLIST A
    a NMTOKENS #FIXED "x1:1 x1:2 x1:3">
    ,,
    <!ATTLIST A
    a IDREF #FIXED "x1">
    ,,
    <!ATTLIST A
    a IDREFS #FIXED "x1 x2 x3">
    ,,


    Risk/Issues

    1. Effort required to develop & unit test will take 3-4 man-weeks
    2. There may be some DTD constructs that may not be trivial to map to XML Schema.
    3. In general using XSLT to translate large XML files would require huge memory, but in this case I am assuming DTD
    typically would not be that big.
    4. Stylesheets needs to be made modular so as to enable extensions

    Resources

    • XML (3rd recommendation) -  http://www.w3.org/TR/REC-xml
    • XML Schema - http://www.w3.org/XML/Schema.html
    • DTD Spec: http://www.w3.org/TR/REC-xml





    Project Features

    About this Project

    XML was started in November 2009, is owned by dstrupl, and has 54 members.
    By use of this website, you agree to the NetBeans Policies and Terms of Use (revision 20140418.2d69abc). © 2013, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo
     
     
    Close
    loading
    Please Confirm
    Close