Navigation Logo 7.5  Use Parentheses to Build more Complicated Patterns Navigation Logo

 

 

Now to change the rules in a way that lets more complicated regular expressions be written:

A quasichar may be replaced with an entire pattern if that pattern is placed inside parentheses and the resulting overall pattern does not apply a repeater to a pattern that can match an empty string.

In the previous section, we built regular expressions from quasichars, anchors, repeaters, and branches. The rules we gave for those regular expressions did not really require that quasichars only match single characters. That just made the rules easier to explain. All that mattered was that a quasichar could be tested to see if it matches a substring beginning at a definite place. A pattern, too, can be tested to see if it matches a substring beginning at a definite place. So, there is no reason not to let quasichars be patterns.

Therefore, we do let quasichars be patterns but we insist that such quasichar patterns be surrounded with parentheses to keep things unambiguous.

Explaining why a quasichar pattern that matches an emptyf string cannot have a repeater operand after it is more difficult. After all, the theory says that the * repeater is idempotent which should mean that a** is the same as a*. Why then should the practice forbid a** or (a*)*? I have not looked at the code to see why but I suppose it has something to do with avoiding infinite recursion or an infinite loop. Whatever the reason, theory and practice differ here. However, the divergence is not very consequential.

Now for an example. Consider this,

x*
which matches zero or more copies of the letter x and this,
cat|dog
which matches "cat" or "dog." If we replace the quasichar x with the pattern in parentheses, we get
(cat|dog)*
which matches zero or more consecutive substrings, each of which is "cat" or "dog."

To be even more concrete,

regexp "(cat|dog)*" catdogcatbert Match
will return true and set Match to catdogcat.

Exercise 7.5a

Which of the following will return true? Of those that do, what is assigned to the variable Match? Of those that do not, why?

set NoLetter_ {[^A-Za-z]}
set OkChar_ {[a-z@\.]}
regexp "(cat | dog)*bert"  catdogbert Match
regexp "($NoLetter_+|nil) + ($NoLetter_+|nil)" "Answer: 2.6 + nillem" Match
regexp -nocase "^(From:|To:) *$OkChar_+$" \
       "From: jazimmer@acm.org\n" \
       Match

Solution

Here is a short example of the power of parentheses. Recall that the Tcl pattern matcher interprets ^ as an empty string just before the first character of the string you are trying to match. In other words, ^ is not just a control character the way ( is. Instead, ^ is seen as matching something. Now, consider the following,

set LineBrk_ "\n"
regexp "(^|$LineBrk_)To:" $Str Match
This will match the first occurrence of "To:" which is immediately preceded by the start of the given string or a break between lines. In other words, it matches the first occurrence of "To:" at the beginning of a line.

Exercise 7.5b

Finish implementing this procedure,
proc getSummary String { ... }
String is viewed as a sequence of lines. Lines are separated with the \n character. There may be any number of lines. The last line may, or may not, end with a \n.

The purpose of getSummary is to return the complete line that begins with the word "Summary" – not including any \n. "Summary" may be indented. If the word "Summary" begins more than one line, then the first one is returned. If the word "Summary" begins no lines, then the empty string is returned.

To discover that "Summary" begins a line, you have make sure the "S" is the very first letter or follows a end-of-line character. This may get an unwanted \n into your match. You can get rid of it with a string action. (There is another way to accomplish this match which is described in the next section. Use it if you like.)

Solution

 

 

[Sample TK Application]
Author's Home Page
Navigation Logo [Book's Cover]
Order from Amazon.