Navigation Logo 7.3  Regular-Expression versus Glob Patterns Navigation Logo

 

 

Regular-expression pattern matching is similar to glob pattern matching in these ways:

  • Both accept exact matches between the pattern string and the given string.

    However, regular-expression matching also accepts exact matches between the pattern string and a substring of the given string. Substrings, of course, contain consecutive characters. When more than one substring matches, a substring that begins earlier has precedence. Rules for breaking other ties are explained later.

  • Both accept those inexact matches that are accounted for by special characters in the pattern string.

  • Both will match square brackets against a single character – the point of the brackets being to define a set of acceptable characters. For example, [a-z] will match a single lowercase letter and [:;] will match either a colon or a semicolon. (See Glob Patterns.)

    However, regular-expression matching provides another way of specifying the set of characters that are acceptable matches. If the character immediately after the [ is a ^, then the match will be with any ASCII character which is not listed. For example, [^&] matches any character except the ampersand and [^a-zA-Z] matches any character except a letter. (Some glob pattern matchers accept this kind of pattern too, but this feature is not universal in the glob world.)

    Another small difference is that the strange acceptance of [z-a] by Tcl's glob pattern matcher is not carried over to regular-expression pattern matching.

  • Both use backslash substitution, called backslash quoting, to permit special characters to appear without their special meanings. For example, \[ in both means the left square bracket itself and is not a special way of matching a single character.

The special characters for regular-expression pattern matching are a little different than for glob pattern matching. Here they are:

?*[]-^.+$|()

You have seen how the square brackets and hyphen are used. These are the only globlike symbols in the list. Others which appear to be globlike have different meanings in the regular expression world.

Exercise 7.3a

What will the regular expression "$Pre1_$Pre2" match after these preassignments?
set Pre1_ {\[0\-9\]}
set Pre2_ "\[0-9]"
Suppose the second assignment is
set Pre2_ "[0-9]"
What happens?

Solution

Tcl's regular-expression pattern matcher is invoked with the regexp command. Here are two forms in which regexp can be used. The next section explains a form in which more arguments are used. Switches are described later in this section.

regexp ?SWITCHES? PATTERN STRING
This command returns true or false depending whether PATTERN matches STRING.

regexp ?SWITCHES? PATTERN STRING VARIABLE_NAME
This form is like the one above except that, when there is a match, the matched substring is assigned to VARIABLE_NAME.

Also see the use of regexp with parentheses below in Use Parentheses to Build more Complicated Patterns and Use Parentheses to Extract Subpatterns.

You can try this command on some glob patterns that use square brackets in a way acceptable for regular expressions:

% regexp {[a-z][A-Z]} aX Match
1
% set Match
aX
% regexp {[a-z][A-Z]} AbCdEf Match
1
% set Match
bC
However, this violates my convention for writing regular expressions and so I would write it this way:
% set Letters_ {[a-z][A-Z]}
[a-z][A-Z]
% regexp $Letters_ aX Match
1
% set Match
aX
% regexp $Letters_ AbCdEf Match
1
% set Match
bC

The second example shows one of the differences between regular expressions and globs: regular expressions will match substrings. When, as in this case, more than one substring could match, the one which begins first is chosen.

If you want to force a match with a whole string, it is possible. Two of the special symbols help.

One of these is ^. Although ^ has the meaning described above when it follows the special symbol [, the meaning is different when ^ appears at the beginning of a pattern. At the beginning of a pattern, it matches an imaginary empty substring that appears just before the beginning of the string to be matched. Here are some examples.

% set SmallLetter_ [a-z]
% regexp "^$SmallLetter_" AbCdEf Match
0
% regexp "^$SmallLetter_" ab Match
1
% set Match
a
% regexp "^" ab Match
1
% set Match
The first regexp returns 0 because the empty string before AbCdEf is followed by the letter A which does not match [a-z]. The last regexp command returns the empty string, the one found just before the ab in the pattern.

Remark

This interpretation of what ^ matches and the fact that "^" matches "ab," are not universal truths in the world of regular-expression pattern matching. For example, my Perl interpreter does not agree with the last of these examples.

Another special symbol that helps force a pattern to match a whole string is $. At the end of a pattern, this matches an imaginary empty string that appears immediately after the string to be matched.

When the symbols ^ and $ are used as just described, they are called anchors because they have the effect of anchoring the matching substring at the beginning or ending of the given string.

Exercise 7.3b

Rewrite the following with regexp.
string match {Tcl} $Name

Solution

Another special symbol is the period . which matches any single character. This is analogous to the use of ? in glob pattern matching.

Exercise 7.3c

Preassign a subpattern, NoDot_, that matches any character that is not a period.

Solution

To finish up this section, here are the switches for regexp:

-nocase
This causes letters in STRING to be converted to lowercase before matching begins. The change only affects a copy of STRING used in matching, STRING itself is unchanged. The effect is that lowercase letters in your pattern will match letters of either case in STRING.

-indices
This works with the second form or regexp. Its effect is to cause a two-number list to be assigned to VARIABLE_NAME – the first number is the first index in STRING of the matching substring and the second number is the last index in STRING of the matching substring.

Exercise 7.3d

Fill in the question marks.
% regexp -indices "\[a-z]ab"  abab Match
1
% set Match
?
% regexp -indices t$ catbert Match
1
% set Match
?

Solution

 

 

[Sample TK Application]
Author's Home Page
Navigation Logo [Book's Cover]
Order from Amazon.