How to Match Common Things
Randal L. Schwartz
Regular expressions can be handy, when used correctly, to distinguish things
of interest from among the strings in which they are hiding and to reject strings
that don't belong. These typical uses for text manipulation and input validation
result in a lot of common regular expressions to solve these frequent tasks.
However, I often see mistakes in selecting and applying a regular expression,
so let's take a look at some of the more common mistakes. As I go through the
examples, I'll presume the string to be validated is in $_ just to keep
the examples simple, and I'll also use the slash delimiters (except where otherwise
noted) for the regular expressions.
For example, one frequent check is to determine whether a string contains
a positive integer. If I weren't thinking properly, I might start with something
like /[0-9]+/ to say "one or more digits". I can simplify this to /\d+/,
but that's still wrong, because the match isn't anchored. This means
that the regular expression will match as long as the string contains the regular
expression, including things like "abc123de". Oops.
So, the next step is to add anchors. Locking the regular expression down to
both the beginning and ending of the string typically looks like /^\d+$/.
However, this is still wrong, even though I frequently see this solution. The
problem is that $ can match either before or after a final newline in
the string, so this regular expression can match "123\n" as well as "123".
Again, oops!
Luckily, modern Perl versions provide the \z anchor, which really does
mean "end of string" always. So, the proper answer is /^\d+\z/.
|