|
Regular-expression pattern matching is similar to glob pattern matching
in these ways:
- Both accept exact matches between the pattern string and the given string.
However, regular-expression matching also accepts exact matches between the
pattern string and a substring of the given string. Substrings, of course,
contain consecutive characters. When more than one substring matches, a
substring that begins earlier has precedence. Rules for breaking other ties
are explained later.
- Both accept those inexact matches that are accounted for by special
characters in the pattern string.
- Both will match square brackets against a single character the point
of the brackets being to define a set of acceptable characters. For example,
[a-z] will match a single lowercase letter and [:;] will match
either a colon or a semicolon. (See
Glob Patterns.)
However, regular-expression matching provides another way of specifying the
set of characters that are acceptable matches. If the character immediately
after the [ is a ^, then the match will be with any ASCII
character which is not listed. For example, [^&] matches any
character except the ampersand and [^a-zA-Z] matches any character except
a letter. (Some glob pattern matchers accept this kind of pattern too,
but this feature is not universal in the glob world.)
Another small difference is that the strange acceptance of [z-a] by
Tcl's glob pattern matcher is not carried over to regular-expression pattern
matching.
- Both use backslash substitution, called backslash quoting, to permit
special characters to appear without their special meanings. For example,
\[ in both means the left square bracket itself and is not a special way
of matching a single character.
The special characters for regular-expression pattern matching are
a little different than for glob pattern matching. Here they are:
?*[]-^.+$|()
You have seen how the square brackets and hyphen are used. These
are the only globlike symbols in the list. Others which appear to
be globlike have different meanings in the regular expression world.
Exercise 7.3a -
What will the regular expression "$Pre1_$Pre2" match
after these preassignments?
set Pre1_ {\[0\-9\]}
set Pre2_ "\[0-9]"
Suppose the second assignment is
set Pre2_ "[0-9]"
What happens?
Solution
Tcl's regular-expression pattern matcher is invoked with the
regexp command. Here are two forms in which regexp can be
used. The next section explains a form in which more arguments are used.
Switches are described later in this section.
regexp ?SWITCHES? PATTERN STRING - This command returns true or
false depending whether PATTERN matches STRING.
regexp ?SWITCHES? PATTERN STRING VARIABLE_NAME - This form is like
the one above except that, when there is a match, the matched substring is
assigned to VARIABLE_NAME.
|
Also see the use of regexp with parentheses below in
Use Parentheses to Build more Complicated Patterns and
Use Parentheses to Extract Subpatterns.
You can try this command on some glob patterns that use square
brackets in a way acceptable for regular expressions:
% regexp {[a-z][A-Z]} aX Match
1
% set Match
aX
% regexp {[a-z][A-Z]} AbCdEf Match
1
% set Match
bC
However, this violates my convention for writing regular expressions and
so I would write it this way:
% set Letters_ {[a-z][A-Z]}
[a-z][A-Z]
% regexp $Letters_ aX Match
1
% set Match
aX
% regexp $Letters_ AbCdEf Match
1
% set Match
bC
The second example shows one of the differences between regular expressions
and globs: regular expressions will match substrings. When, as in this
case, more than one substring could match, the one which begins first
is chosen.
If you want to force a match with a whole string, it is possible. Two of
the special symbols help.
One of these is ^. Although ^ has the meaning described above
when it follows the special symbol [, the meaning is different when
^ appears at the beginning of a pattern. At the beginning of a
pattern, it matches an imaginary empty substring that appears just before the
beginning of the string to be matched. Here are some examples.
% set SmallLetter_ [a-z]
% regexp "^$SmallLetter_" AbCdEf Match
0
% regexp "^$SmallLetter_" ab Match
1
% set Match
a
% regexp "^" ab Match
1
% set Match
The first regexp returns 0 because the empty string before AbCdEf
is followed by the letter A which does not match [a-z]. The last
regexp command returns the empty string, the one found just before the
ab in the pattern.
Remark - This interpretation of what ^ matches and the
fact that "^" matches "ab," are not universal truths in the world of
regular-expression pattern matching. For example, my Perl interpreter
does not agree with the last of these examples.
Another special symbol that helps force a pattern to match a whole
string is $. At the end of a pattern, this matches an imaginary
empty string that appears immediately after the string to be matched.
When the symbols ^ and $ are used as just described, they are
called anchors because they have the effect of anchoring the
matching substring at the beginning or ending of the given string.
Exercise 7.3b -
Rewrite the following with regexp.
string match {Tcl} $Name
Solution
Another special symbol is the period . which matches any single
character. This is analogous to the use of ? in glob pattern
matching.
Exercise 7.3c -
Preassign a subpattern, NoDot_, that matches
any character that is not a period.
Solution
To finish up this section, here are the switches for regexp:
-nocase - This causes letters in STRING to be converted to
lowercase before matching begins. The change only affects a copy of
STRING used in matching, STRING itself is unchanged. The
effect is that lowercase letters in your pattern will match letters
of either case in STRING.
-indices - This works with the second form or regexp. Its
effect is to cause a two-number list to be assigned to
VARIABLE_NAME the first number is the first index in STRING of
the matching substring and the second number is the last index in STRING
of the matching substring.
|
Exercise 7.3d -
Fill in the question marks.
% regexp -indices "\[a-z]ab" abab Match
1
% set Match
?
% regexp -indices t$ catbert Match
1
% set Match
?
Solution
|