Relax NG lets you create patterns of elements, attributes, and content and then tests them against documents. Regular expressions let you create patterns of characters and test strings against them. For example, anything that matches three digits, a dash, three more digits, another dash, and four digits is considered to be a valid US phone number. But let's start out with simple regular expressions.
If I specify a pattern with plain characters in it, the pattern
will match that string, and only that string. Thus <param name="pattern">bat</param>
will be valid only if the content or attribute value to which
it's attached is exactly the word bat
. Note to Perl
programmers: Relax NG patterns are presumed to be anchored with
^
at the start and $
at the end.
Note: from here on in, we'll leave out the <param>
tags; they're the same in all the examples. We'll just show the
pattern itself.
Now let's say I wanted to match any of the words
bat
, bet
, bit
, or
but
. I would specify this pattern:
b[aeiu]t
. The characters in the square brackets
are called a character class, and
they tell the pattern matcher to match any one of those characters.
That's one character. The preceding pattern will not
match beat
, beet
, bait
, or
beaut
.
We'll see how to get around this problem later.
Let's say I wanted a pattern that matched any uppercase letter
followed by any even digit. I could write this pattern:
[ABCDEFGHIJKLMNOPQRSTUVWXYZ][02468]
, but there's an easier
way to specify a contiguous range: [A-Z][02468]
. You can
have as many ranges as you like within a single set of square brackets.
The following character class will match any uppercase letter A through
G, lowercase letter r through u, or the letter m:
[A-Gr-um]
. Order doesn't matter; we could equally well
have specified it as [mr-uA-G]
If you want to match a
hyphen as part of a character class, put it at the beginning or at the
end. The following will match the letter A, C, or a hyphen:
[-AC]
, but this one matches A, B, or C and no hyphen:
[A-C]
.
If you want to match anything except a vowel, you
use an up arrow at the beginning of the character class:
[^aeiou]
. Note that this will match anything except those
five letters; it will match the letter b
as well as
a comma or the digit 7
.
Some character classes are so useful that abbreviations have been developed for them:
.
- (dot) any character at all except newline\s
- a whitespace character\d
- a digit (effectively [0-9]
)\w
- a word character (not a punctuation, separator, or
“other”)\i
- initial character of an XML name\c
- any character that can appear in an XML nameTheir inverses are specified by
\S
, \D
, \W
,
\I
, and \C
.
We could write the pattern for that phone number as
\d\d\d-\d\d\d-\d\d\d\d
, but we can also use
quantifiers to make our job easier.
You can specify a quanitifer by following a character (or character
class) by one of:
{n} | exactly n occurrences |
{n,m} | at least n but no more than m occurrences |
{n,} | n or more occurrences |
+ | one or more occurrences (same as {1,} ) |
? | zero or one occurrences (same as {0,1} ) |
* | zero or more occurrences (same as {0,} ) |
This information lets us rewrite the phone number as
\d{3}-\d{3}-\d{4}
. The pattern
b[aeiou]+t
will match bat
, beet
,
beaut
, and beaieueaot
. Hey, it's not a
perfect world.
Let's say we want to make the first three digits and hyphen of a
phone number optional. We have to group those items together and
follow them with a question mark:
(\d{3}-)?\d{3}-\d{4}
You use the vertical bar |
symbol to specify choices.
For example, let's say you want to specify that the src
attribute of an <img>
element must end with
.jpg
, .gif
, or .png
<attribute name="src"> <data type="string"> <param name="pattern">.+\.(jpg|gif|png)</param> </data> </attribute>
Analyzing the pattern one section at a time: the attribute must
match one or more of any kind of character (.+
)
followed by a period (\.
). Notice that we need a
backslash to indicate that this isn't the
dot-that-means-any-character-at-all, but a real dot. If you
want to match to an actual parenthesis, star, question mark,
plus sign, or vertical bar, or backslash, those must also be
preceded by a backslash.
Finally, we have a group of alternatives separated by vertical bars.
This is by no means an exhaustive explanation of regular expressions, but it should be enough to get you started, and is certainly enough to let you construct useful and non-trivial patterns.