B

DocBook and XML

$Revision: 1.1 $

$Date: 2001/08/02 10:22:22 $

XML, the Extensible Markup Language, is a simple dialect of SGML. In the words of the XML specification, “the goal [of XML] is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.”

XML raises two issues with respect to DocBook:

If you have an existing SGML system, and your primary goal is to serve DocBook documents over the Web as XML, only the first of these issues is relevant. As the popularity of XML grows, we will see more and more XML-aware tools that don't implement full ISO 8879 SGML. If your goal is to author DocBook documents with one of this new generation of tools, you will only be able to achieve validity with an XML DocBook DTD.

Although not yet officially adopted by the OASIS DocBook Technical Committee, an XML version of DocBook is available now and provided on the CD-ROM.

DocBook Instances as XML

Most DocBook documents can be made into well-formed XML documents very easily. With few exceptions, valid DocBook SGML instances are also well-formed XML instances. The following areas may need to be addressed.

System Identifiers

It is common for SGML instances to use only a public identifier in document type and parameter entity declarations:

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter><title>Chapter Title</title>
<para>
This <emphasis>paragraph</paragraph> is important.
</para>
</chapter>

XML requires a system identifier:

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter><title>Chapter Title</title>
<para>
This <emphasis>paragraph</paragraph> is important.
</para>
</chapter>

If you're used to using catalog files to resolve system identifiers, you may be dismayed to learn that system identifiers are required. Because most tools favor system identifiers over public identifiers, all of the portability that was gained by the use of catalog files seems to have been lost. In the long run, it'll be regained by the fact that XML system identifiers can be URNs, which will have a resolution scheme like catalogs, but what about the short run?

Luckily, there are a couple of options. First, you can tell your tools to use the public identifiers even though system identifiers are present. Simply add:

OVERRIDE YES

to your catalog files. Alternatively, you can remap system identifers with the SYSTEM catalog directive. If you are faced with documents that don't use public identifiers at all, this is probably your only option.

Minimization

If you have used SGML minimization features in your instances:

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter id=1chap1><title>Chapter Title</title>
<para>
This <emphasis>paragraph2</> is important.
</para>
</chapter>

they will not be well-formed XML instances. In particular, XML

1

Requires that all attribute values be quoted.

2

Does not allow short tag minization.

XML also forbids tag omission, and there are probably a half dozen or so more exotic examples of minimization that you have used. They're all illegal. The easiest way to remove these minimizations is probably with a tool like sgmlnorm (included in the SP and Jade distributions, on the CD-ROM).

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter id="chap1"><title>Chapter Title</title>
<para>
This <emphasis>paragraph</emphasis> is important.
</para>
</chapter>

Attribute Default Values

Correct processing of this document may require access to the default attributes:

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter><title>Chapter Title</title>
<para>
Write to us at:
<address1>
90 Sherman Street
Cambridge, MA 02140
</address>
</para>
</chapter>
1

Address expresses that its content is line-specific with an attribute.

Some XML processing environments are going to ignore the doctype declaration in your document, even if it's present. This is relevant when your instance uses elements that have attributes with default values. The default values are expressed in the DTD, but may not be expressed in your instance. In the case of DocBook, there are relatively few of these, and your stylesheet can probably be constructed to do the right thing in either case. (It essentially treats the attributes as if they had implied values.)

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter><title>Chapter Title</title>
<para>
Write to us at:
<address format="linespecific">
90 Sherman Street
Cambridge, MA 02140
</address>
</para>
</chapter>

Character and SDATA Entities

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter><title>Chapter Title</title>
<para>
This book was published by O'Reilly1&trade;.
</para>
</chapter>
1

The DocBook DTD defines all of the standard ISO entities automatically, but the ISO definitions use SDATA, which is not allowed in XML. Eventually, ISO (or someone else) will release official ISO standard entity sets that make reference to the appropriate Unicode character for each entity. Until then, the XML version of DocBook is distributed with an unofficial set.

If you use entities in your document, it may be wise to put declarations for them in the internal subset of each instance, because some XML browsers are going to parse the internal subset but not the external subset. If the entity declarations are in your DTD, and the browser does not parse the external subset, the browser won't know how to display the entities in your document.

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
<!ENTITY trade "&#x2122;">
<chapter><title>Chapter Title</title>
<para>
This book was published by O'Reilly&trade;.
</para>
</chapter>

Case-Sensitivity

1<!DocType Book PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
2<book><title>Book Title</title>
<chapter><title>Chapter Title3</Title>
<para>
Paragraph test.
</para>
4<PARA>
A second paragraph.
</PARA>
</chapter>
</book>

With the standard DocBook SGML declaration, DocBook instances are not case-sensitive with respect to element and attribute names. XML is always case-sensitive. As long as you have used the same case consistently, your XML instances will be well-formed, but it may still be advantageous to do some case-folding because it will simplify the construction of stylesheets.

1

Keywords in XML are case-sensitive, and must be in uppercase.

2

The name declared in the document type declaration, like all other names, is case-sensitive.

3

Start and end tags must use the same case.

4

In XML, Para is not the same as PARA. Note that this is a validity error (against the XML version of DocBook), but it is not an XML well-formedness error. The use of para and PARA as distinct names is as legitimate as using foo and bar, as long as they are properly nested.

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<book><title>Book Title</title>
<chapter><title>Chapter Title</title>
<para>
Paragraph test.
</para>
<para>
A second paragraph.
</para>
</chapter>
</book>

No #CONREF Attributes

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter><title>Chapter Title</title>
<indexterm id="idx-bor"><primary>Something</primary></indexterm>1
<para>
Paragraph test.
</para>
<indexterm startref="idx-bor">2
</chapter>

The StartRef attribute on indexterm and the OtherTerm attribute on GlossSee and GlossSeeAlso are #CONREF attributes.

In SGML terms, this means that when these attributes are used, the content of the tag is taken to be the same as the content of the tag pointed to by the attribute.

1 2

If you have used these attributes, your instance will contain both empty and non-empty versions of these tags.

Your best bet is to transform the #CONREF version into an empty tag and let your stylesheet deal with it appropriately.

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter><title>Chapter Title</title>
<indexterm id="idx-bor"><primary>Something</primary></indexterm>
<para>
Paragraph test.
</para>
<indexterm startref="idx-bor"/>
</chapter>

Only Explicit CDATA-Marked Sections Are Allowed

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [
<!ENTITY % draft "IGNORE">
<!ENTITY % sourcecode "CDATA">
]>
<chapter><title>Chapter Title</title>
1<![ %draft; [
<para>
Draft paragraph.
</para>
]]>
<para>
The following code is totally out of context:
<programlisting>
<![ 2%sourcecode; [
if (x < 3) {
  y = 3;
}
]]>
</programlisting>
</chapter>
1 2

Parameter entities are not allowed in the body of XML documents (they are allowed in the internal subset).

1

XML instances cannot contain IGNORE, INCLUDE, TEMP, or RCDATA marked sections.

2

CDATA marked sections must use the “CDATA” keyword literally because parameter entities are not allowed.

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter><title>Chapter Title</title>
<para>
The following code is totally out of context:
<programlisting>
<![CDATA[
if (x < 3) {
  y = 3;
}
]]>
</programlisting>
</chapter>

No SUBDOC or CDATA External Entities

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [
<!ENTITY % sourcecode SYSTEM "program.c" CDATA>
]>
<chapter><title>Chapter Title</title>
<para>
The following code is totally out of context:
<programlisting>
&sourcecode;
</programlisting>
</chapter>

XML instances cannot use CDATA or SUBDOC external entities. One option for integrating external CDATA content into a document is to employ a pre-processing pass that inserts the content inline, wrapped in a CDATA marked section.

SUBDOC entities may be more problematic. If you do not require validation, it may be sufficient to simply put them inline. XML namespaces may offer another possible solution.

The result will be something like this:

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<chapter><title>Chapter Title</title>
<para>
The following code is totally out of context:
<programlisting>
<![CDATA[
int main () {
..
}
]]>
</programlisting>
</chapter>

No Data Attributes on Notations

They're not allowed in XML, so don't add any.

No Attribute Value Specifications on
Entity Declarations

They're not allowed in XML, so don't add any.

The DocBook DTD as XML

Converting the DocBook DTD to XML is much more challenging than converting the instances. It is probably not possible to construct an XML DTD that is identical to the validation power of DocBook. The list below identifies most of the issues that must be addressed, and describes how the DocBook XML DTD; deals with them:

Comments are not allowed inside markup declarations

Most of them have been moved to comment declarations preceding the markup declaration that used to contain them. A few small, inline comments that seemed like they would be out of context if moved before the declaration were simply deleted.

Name groups are not allowed in element or attribute list declarations

The small number of places in which DocBook uses name groups have been expanded.

There's one downside: DocBook uses %admon.class; in a name group to define the content model, and attribute lists for elements in the admonitions class. In DocBook XML, this convenience cannot be expressed. If additional admonitions are added, the element and attribute list declarations will have to be copied for them.

No CDATA or RCDATA declared content

Graphic and InlineGraphic have been made EMPTY. The content model for SynopFragmentRef , the only RCDATA element in DocBook, has been changed to (arg | group)+.

No exclusions or inclusions on element declarations

They had to be removed.

In DocBook, exclusions are used to exclude the following:

Removing these exclusions from DocBook XML means that it is now valid, in the XML sense, to do some things that don't make a lot of sense (like put a Footnote in a Footnote). Be careful.

Inclusions in DocBook are used to add the ubiquitious elements ( indexterm and BeginPage) unconditionally to a large number of contexts. In order to make these elements available in DocBook XML, they have been added to most of the parameter entities that include #PCDATA. If new locations are discovered where these terms are desired, DocBook XML will be updated.

Elements with mixed content must have #PCDATA first.

The content models of many elements have been updated to make them a repeatable OR group beginning with #PCDATA.

Many declared attribute types (NAME, NUMBER, NUTOKEN, and so on) are not allowed

They have all been replaced by NMTOKEN or CDATA.

No #CONREF attributes allowed.

The #CONREF attributes on indexterm, GlossSee, and GlossSeeAlso were changed to #IMPLIED. The content model of indexterm was modified so that it can be empty.

Attribute default values must be quoted.

Quotes were added wherever necessary.