Basic Regular Expressions
#!/usr/bin/perl -w
####################### Perl Regular Expressions #######################
# Expression Meaning
# ---------- -------
# \d One digit character.
# \D One non-digit character.
# \s One whitespace character. A whitespace
# character is a newline (\n), carriage
# return (\r), space, tab (\t) or formfeed (\f).
# \S One non-whitespace character.
# \w One "word" character (digit, letter or _).
# \W One non-word character.
# * Longest string of zero or more of the
# preceding character or character group.
# + Longest string of ONE or MORE of the
# preceding character or character group.
# ? Zero or one only of the preceding character
# or character group.
# (chars) Tag characters for purpose of recalling via
# $1, $2, etc. or for grouping to use with
# *, +, or ? expressions.
# ? after + or * Turn off greedy matching e.g., .*?X means
# shortest string of anything before first X.
# {3} Three of preceding character (or group).
# {3,} Three or more of preceding char. (or group).
# {3,5} Between three and five of the preceding
# character or character group.
# \b Zero-width word boundary.
# | "Or" bar -- used to list alternatives.
# (?:chars) Group chars. but DO NOT TURN INTO A MEMORY
# VARIABLE i.e., $1, $2, etc..
# [chars] One character which is a member of chars.
# [^chars] One character which is NOT a member of chars.
# \ Take next character literally (NOT a regexp.).
# \Q....\E Take ALL characters between \Q and \E literally
# . Match any ASCII character (except newline!!!).
# ^ and $ Beginning (ending) of line.
################### Code Demo of Regular Expressions ####################
$var = "ABC123";
$var =~ s/\d//g; # Remove all digit characters.
print "$var\n\n";
$var = "ABC123";
$var =~ s/\D//g; # Remove all non-digit characters.
print "$var\n\n";
$var = "%ABC*34_";
$var =~ s/\w//g; # Remove all "word" characters (letters, digits, underscore)
print "$var\n\n";
$var = "%ABC*34_";
$var =~ s/\W//g; # Remove all non-word characters.
print "$var\n\n";
$var = "roughhouse millhouse housefly housecat";
$var =~ s/\bhouse/X/g; # Change "house" to X if word boundary before "house".
print "$var\n\n";
$var = "area careen arena bare";
$var =~ s/\Bare\B/X/g; # Change are to X if no word boundary before
print "$var\n\n"; # or after "are".
$var = "a:b:c:d:e";
$var =~ s/^(.*):([^:]*)/$2:$1/; # Swap tagged fields. If you want to swap
print "$var\n\n"; # 2 sides of first colon this is not right!
$var = "a:b:c:d:e";
$var =~ s/^(.*?):([^:]*)/$2:$1/; # Swap tagged fields. If you want to swap
print "$var\n\n"; # 2 sides of first colon this IS right!!
#### ? Turns off "greedy" matching caused by the "*" or "+" regexp characters.
$var = "123:45:67890:7654321";
$var =~ s/\d{3,}/X/g; # Three or more digits become one X.
print "$var\n\n";
$var = "123:45:67890:7654321"; # Each 3-digit groups become one X.
$var =~ s/\d{3}/X/g;
print "$var\n\n";
$var = "123:45:67890:7654321";
$var =~ s/\d{2,4}/X/g; # Two to four digits (greedy!) become X.
print "$var\n\n";
$var = "1.2.3.4.5.6";
$var =~ s/\./X/; # Change literal period to X -- first occurrence only!!
print "$var\n\n";
$var = "1.2.3.4.5.6";
$var =~ s/\./X/g; # Change literal period to X -- all occurrences only!!
print "$var\n\n";
$var = "*.?+][";
$var2 = ".?+";
$var =~ s/\Q$var2\E/X/; # Characters in $var2 taken LITERALLY!
print "$var\n\n";
$var = "a1b2c3X4Y5Z6";
$var =~ s/[a-z]/*/g; # Change each lowercase letter to an asterisk.
print "$var\n\n";
$var = "a1b2c3X4Y5Z6";
$var =~ s/[a-z]/*/gi; # Change each letter to an asterisk.
print "$var\n\n"; # i option means "case insensitive".
$var = "a1b2c3X4Y5Z6";
$var =~ s/[^a-z]/*/g; # Change each non-lowercase letter to an asterisk.
print "$var\n\n";
$var = "a1b2c3X4Y5Z6";
$var =~ s/[^a-z]/*/gi; # Change each non-letter to an asterisk.
print "$var\n\n";
$var = "a1b2c3d4e5f6g7";
$var =~ s/[a-c5-7F]/X/g;
print "$var\n\n"; # Change a through c, 5 through 7, and F to X.
$var = "abcdefg";
$var =~ s/^./X/; # Change first char of line to X.
print "$var\n\n";
$var = "abcdefg";
$var =~ s/.$/X/; # Change last char of line to X.
print "$var\n\n";
$var = "abcdefg";
$var =~ s/^(.)(.*)(.)$/$3$2$1/; # Swap first and last character.
print "$var\n\n";
print "Enter string: ";
while (($line = <STDIN>) =~ /\d/) # As long as $line has a digit keep going.
{
print "$line\n";
print "Enter string: ";
}
print "Enter string: ";
while (($line = <STDIN>) !~ /\d/) # As long as $line has NO digit keep going.
{
print "$line\n";
print "Enter string: ";
}
print "Enter string: ";
while (($line = <STDIN>) !~ /^\s*quit\s*$/i) # As long as $line is not "quit"
{ # (any case with allowable
print "$line\n"; # surrounding whitesp) keep going
print "Enter string: ";
}
############################## Program Output ############################
$ regexp.pl
ABC # All digit chars removed.
123 # All non-digit chars removed.
%* # All word chars removed.
ABC34_ # All non-word chars removed.
roughhouse millhouse Xfly Xcat # "House" changed to X if it starts a word.
area cXen arena bare # "Are" changed to X if no word boundary on
# either side.
e:a:b:c:d # Bad swap!
b:a:c:d:e # Two sides of first colon swapped.
X:45:X:X # Three to infinity digits changed to X's.
X:45:X90:XX1 # EXACTLY three digits changed to X's.
X:X:X0:XX # Two to four digits (greedy!!) changed to X's.
1X2.3.4.5.6 # First period changed to X.
1X2X3X4X5X6 # All periods changed to X.
*X][ # Substitute done with \Q...\E around pattern.
*1*2*3X4Y5Z6 # Lowercase letters changed to *'s.
*1*2*3*4*5*6 # All letters changed to *'s.
a*b*c******* # Each non-lowercase letter character changed to *.
a*b*c*X*Y*Z* # Each non-letter character changed to *.
X1X2X3d4eXfXgX # a thru c, 5 thru 7, and F changed to X.
Xbcdefg # First character of line changed to X.
abcdefX # Last character of line changed to X.
gbcdefa # First and last character of line swapped.
Enter string: abc3 # Start loop which stops if $line doesn't have a digit.
abc3
Enter string: x*7io
x*7io
Enter string: abc # Exits first while because has no digit.
Enter string: XYZ(*& # Start loop which stops if $line DOES have a digit.
XYZ(*&
Enter string: jh_)(
jh_)(
Enter string: gh7Y # Exits second while because has a digit.
Enter string: this # Start loop which stops when user enters "quit".
this
Enter string: is
is
Enter string: not
not
Enter string: QuiT # Exit third while loop.
############################# Some Important Notes ########################
1. The substitution expression itself has a value which can sometimes be
useful. For example:
$str = "abc4rp67###8";
$countdigits = $str =~ s/\d/X/g;
print "$countdigits\n"; # Will print 4
2. You must remember that + and * are GREEDY matchers. Here are some common
mistakes of forgetting this:
$str = "yes:no:maybe:perhaps";
$str =~ s/^(.*):(.*)/$2:$1/;
print "$str\n"; # Will print perhaps:yes:no:maybe
If you wanted to swap the two sides of the very first colon, the following
would have been correct:
$str =~ s/^(.*?):([^:]*)/;
The ? after the .* says that you want the SHORTEST string of zero more
ASCII characters leading up to a colon, NOT the LONGEST!! The second
tagged field must be [^:]* -- the longest string of zero or more
NON-COLONS. You don't want to go past the second colon if there is one!!
3. Do NOT use the old-fashioned Unix regular expressions if Perl has a better
one. For example, use \d for a digit, \D for a non-digit, \s for a
whitespace character, and \S for a non-whitespace character.
4. Remember that sometimes the ^ (beginning of line) and $ (end of line)
expressions are vital if you need to describe an ENTIRE LINE. For
example, if you want to match a line which is empty of all whitespace,
the regular expression is:
/^\s*$/
The ^ must be present or you are allowing for the possibility that some
non-whitespace characters precede the \s*. The $ must be present or you
are allowing for the possibility that some non-whitespace characters
follow the \s*. Be alert! Also, remember that [^whatever] means one
character that is NOT a member of the set enclosed in the [^...].