Read pages 162-165, and pages 172-177 of Learning XML.
All XML processors are required to recognize the following five entities:
&
<
>
'
"
Those are the only ones that XML understands. If you want any others,
you can use the numeric values, either in hexadecimal or decimal.
This is less than satisfactory. Quick: what do these entities represent
when displayed? Ñ
á
é
í
ó
ó
ú
¡
and ¿
.
All right, what about these:
ñ
á
é
í
ó
ú
¡
and ¿
.
That's right - those are the entities for doing Spanish text.
Here's how you define them. (We'll do them all in decimal to
be consistent.)
<!ENTITY ntilde "ñ"> <!ENTITY aacute "á"> <!ENTITY eacute "é"> <!ENTITY iacute "í"> <!ENTITY oacute "ó"> <!ENTITY uacute "ú"> <!ENTITY iexcl "¡"> <!ENTITY iquest "¿">
To write the words ¡Acción en
español!
, we'd use the entities as follows:
¡Acción en español!
Of course, you may use entities for any abbreviation you want:
<!ENTITY dac "De Anza College"> <!ENTITY cis "Computer and Information Science"> <!ENTITY fhda "Foothill-De Anza">
In addition to general entities, which are used in an XML document,
there are also parameter entities, which are used as
“shortcuts” within the DTD itself. As long as we're talking
about the Spanish entities, they'd clearly be useful in many different
DTDs. Parameter entities let you “include” other files.
Let's say we put all the entities for the Spanish characters into a
file called spanish.ent
By adding this to the wrestling club DTD, we can then use
the easy-to-read entities for a Spanish translation of the database.
<!ENTITY % spanish SYSTEM "spanish.ent"> %spanish; <!ELEMENT club-database (association+) > <!ELEMENT association (club+) > <!ATTLIST association id ID #REQUIRED >
Note: If you want to include files within an XML file rather than the DTD, use general entities. For example, if you have a book divided into chapters, you can put each chapter into a separate file, and use general entities to include them. This example uses the internal subset of the DTD, as described on pages 176-177 of Learning XML.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE book SYSTEM "/usr/local/mybook/docbook.dtd" [ <!ENTITY ch01 SYSTEM "ch01.xml"> <!ENTITY ch02 SYSTEM "ch02.xml"> <!ENTITY ch03 SYSTEM "ch03.xml"> ]> <book> <title>My Book</title> <subtitle>An Example of Including XML Files</subtitle> &ch01; &ch02; &ch03; </book>
The other use of parameter entities is to modularize code. For example, if you have a genealogy DTD, there is quite a bit of duplicated markup:
<!ELEMENT birth (year, month, day)> <!ELEMENT marriage (person-ref, year, month, day)> <!ELEMENT death (year, month, day)>
Using a parameter entity eliminates the duplication and makes the DTD easier to read. You may also use a parameter entity for a repeated set of attributes, as shown on page 178 of Learning XML.
<!ENTITY % date "year, month, day"> <!ELEMENT birth (%date;)> <!ELEMENT marriage (person-ref, %date;)> <!ELEMENT death (%date;)>
Some people invent parameter entities to make the content of elements or attributes clearer. For example, in the weather report, we might wish to let document writers know that temperatures can have decimals, but water reservoir information must be integers.
<!ENTITY % integer "#PCDATA"> <!ENTITY % float "#PCDATA"> <!ENTITY % text "CDATA"> <!ELEMENT report (temperatures, water-banks)> <!ELEMENT temperatures (city+)> <!ELEMENT city (max, min)> <!ATTLIST city name %text; #REQUIRED> <!ELEMENT max (%float;)> <!ELEMENT min (%float;)> <!ELEMENT water-banks (reservoir+)> <!ELEMENT reservoir (current, capacity)> <!ATTLIST reservoir name %text; #REQUIRED> <!ELEMENT current (%integer;)> <!ELEMENT capacity (%integer;)>
Validators that use DTDs can't enforce this; you could still write the following, and the validator would think everything is fine. Newer methods of writing grammars and their validators can do this enforcement; we'll see it later.
<current>five hundred</current> <capacity>320.5</capacity>
The examples on pages 173-175 of Learning XML explain this
nicely. The only additional thing to note is that, in a DTD, the
first definition is the one that counts, and the internal
subset is parsed before any external DTD. Thus, in the example of the
disclaimer, if %use-disclaimer
is set to
INCLUDE
, the DTD will use the first definition of
disclaimer
, not the empty string. You may redefine an
ATTRIBUTE
or ENTITY
, but not an
ELEMENT
.
As mentioned before, DTDs are not the only way to specify an XML grammar. There are several other candidates, and I've decided to go with Relax NG (RNG) rather than the World Wide Web Consortium's XML Schema. In my opinion, XML Schema is the unfortunate result that occurs when a group of highly intelligent and well-intentioned designers try to create a notation that will be all things to all people.
The following material is not in Learning XML. Much of it has been derived from the Relax NG tutorial, which is online at http://www.oasis-open.org/committees/relax-ng/tutorial.html.
Let's take a very simple grammar: an address book consists of
zero or more cards, each of which consists of a name and
email address.
Here's the specification in RNG. The first line has an
xmlns
attribute, which we will discuss
in a future lecture. The <start>
element tells
a validator where to start validating; i.e., which element is the
root element. The rest is a pattern that tells what
a valid document should look like. Relax NG works by
specifying a pattern for structure and content of valid documents.
Any document that matches
the pattern is valid; any document that doesn't, isn't.
If you've read the tutorial, you'll
notice that they don't have the <grammar>
and <start>
elements. They aren't necessary
for a “self-contained” example like this one, but
as soon as we start specifying more complex grammars, we'll need them.
So why not now?
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="addressBook"> <zeroOrMore> <element name="card"> <element name="name"> <text/> </element> <element name="email"> <text/> </element> </element> </zeroOrMore> </element> </start> </grammar>
If you want to require at least one <card>
element,
replace <zeroOrMore>
with
<oneOrMore>
. An optional element is enclosed
in the <optional>
. So, if we want a card to be
able to contain an optional <note>
element, we'd
have this specification (the added material is in boldface):
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="addressBook"> <zeroOrMore> <element name="card"> <element name="name"> <text/> </element> <element name="email"> <text/> </element> <optional> <element name="note"> <text/> </element> </optional> </element> </zeroOrMore> </element> </start> </grammar>
Let's say that we can contact someone either by email or by phone
number (but not both). We'd modify our specification with a
<choice>
specification:
<choice> <element name="email"> <text/> </element> <element name="phone"> <text/> </element> </choice>
Now let's say that we can either have a name (for example, a company
name) or a first name/last name pair. The pattern on the left will
not work. It would allow only a first name or a last name, but not
both. The pattern on the right groups the first and last name together
with the <group>
specifier,
and it works great.
<choice> <element name="name"> <text/> </element> <element name="firstname"> <text/> </element> <element name="lastname"> <text/> </element> </choice>
<choice> <element name="name"> <text/> </element> <group> <element name="firstname"> <text/> </element> <element name="lastname"> <text/> </element> </group> </choice>