| 7.2 Character Set, Quoting, and Style |
|
|
Tcl regular expressions describe sets of strings of ASCII characters. You already know how to represent ASCII characters in Tcl this is discussed above in the More about Substitution. For example, the statement set Salutation "hi there\n"assigns a string containing two words, a blank and an end-of-line symbol to the variable Salutation. You can pass arbitrary ASCII characters to Tcl's regular-expression command by writing them the Tcl way. Just make sure your arguments are in quotes and not in curly brackets. If your arguments are in curly brackets, it is the regular-expression command that must do the backslash substitution. The first Tcl version whose regular-expression commands do backslash substitution is (the currently experimental) version 8.1. Another point concerning character sets is that Tcl has special characters that have to be protected with backslashes if they are to appear in arguments surrounded by quotes. Regular expressions also have special characters that have to be protected with backslashes whenever they are passed without their special meaning to a regular-expression command. You can avoid the confusion of two different sets of special characters by simply not involving the Tcl interpreter, i.e. by placing all your arguments in curly brackets. However, if you do this, and you are working with a version of Tcl earlier than 8.1, you cannot work with nonprintable characters. This problem also exists with glob patterns. I chose to ignore it in the previous chapter by insisting that all glob arguments be placed in curly brackets. One reason I did that is that I prefer to use regular expressions when things start to get complicated. The question of how to deal with two sets of special characters is more serious for those who use the Tcl extension named Expect. This is because Expect users do lots of pattern matching on the strings of characters that computers send to terminals. Since these strings often contain nonprintable characters the use of curly brackets by Expect users is very often impossible. (It will continue to be impossible until the experimental Tcl 8.1 is finished and has percolated its way into Expect.) Thus it is easy to understand why Don Libes, the creator of Expect , urges you to use quotes. He wants you to work in a consistent environment even if that environment has two sets of special characters to contend with. Unfortunately, working exclusively with quotes can have you writing such commands as expect -re "(%|$|\\\$a) $"so that the regular-expression processor will see (%|$|\$a) $. My own method is somewhat different. It is motivated by the observation that regular expressions tend to be messy and difficult to get right. There was a time when we thought the same thing about almost all programming. Then Edsger Dijkstra wrote a letter that said essentially, "You know, I have noticed the programmers who organize their code into neat blocks get their work done faster and have fewer bugs than programmers who do not." He then pointed to the goto statement as a license to avoid neat blocks. It is my contention that programmers who structure their regular expressions by preassigning relevant subpatterns to variables will get their patterns done faster and with fewer bugs. A one-line regular expression is a license to avoid neat blocks. Preassigning relevant subpatterns to variables is also a solution to the "brackets or quotes" dilemma. Follow Don Libes' rule. Place all your regular expressions in quotes. But, preassign any part of any regular expression that needs a backslash for any reason. When you do a preassignment, you will be using the set command and you will have a choice of passing the subpattern to set surrounded by curly brackets or quotes. Choosing quotes means your focus is on processing by the Tcl interpreter. Choosing curly brackets means your focus is on processing by the regular-expression parser. To judge the value of this solution, all you have to do is take some of the more complicated examples/exercises below and rewrite them without the preassigned subpatterns. One last point: my style of writing regular-expression patterns causes a proliferation in the number of variables that must be remembered. As I argue elsewhere, a proliferation of variable names is not a good idea unless your variable names are organized into neat blocks. One way to create these blocks would be to write a new procedure every time you want to use a regular expression. That seems excessive. Another way is to adopt a naming convention so that it is easy to see which variable names exist solely for use with regular-expression patterns. My naming convention is simple: I put underscores at the end of the names of variables containing regular expression subpatterns. It is possible to look at an example now because it happens that regular-expression patterns accept the square bracket notation you have seen for globs. So you already know the pattern that means "digit." Here is a preassigned subpattern.
set Digit_ {[0-9]}
With this preassignment, you can write $Digit_ instead of \[0-9]
to mean "digit" in your regular expressions.
|
Author's Home Page |
|
Order from Amazon. |