Special Characters
Regular expression notation includes the use of special characters (not to be confused with HTML special characters). Special characters in regular expressions enable you to specify more advanced matches in which portions of the match may be one of a number of characters, or where the match must occur at a certain position in the string.
As you've already seen, you can use a backslash to escape certain characters' special meanings. For example, to echo a double-quote character ", you have to use the escape sequence \" (just like the addslashes() function does for the database entry strings).
The characters that are given special meaning within a regular expression, which you will need to backslash if you want to use literally, are:
. * ? + [ ] ( ) { } ^ $ | \
Any other characters automatically assume their literal meanings. For example, if you want to specifically match "..." in the preceding text samples, you'd have to say:
<?php
$words1 = "The bigdog is in the pound...";
$regexp = "pound\.\.\...";
if (ereg($regexp, $words1, $reg)) echo "Found string '$reg[0]'";
?>
If you used the regexp "pound...", you'd find it still matched the test string, but would also match in the following case, because the dot is not considered just a dot in a regex; it is considered a special character that makes anything:
<?php
$words1 = "The bigdog is in the pound but the dog is in the cornfield.";
$regexp = "pound...";
if (ereg($regexp, $words1, $reg)) echo "Found string '$reg[0]'";
?>
This returns the following:
Found string 'pound bu'
As mentioned, this happens because the dot (.) is a special character. It matches against any single character except the new line. So it matches any single characters after "pound", not just three dots in a row.
A Few Shortcuts and Options
There are several options available to you for formulating patterns to match against. Go ahead and check them out.
Character Classes: [xyz]
Square brackets surrounding a pattern of characters is called a character class, and signifies that any of that set of characters is acceptable. For example, the regexp "w[ao]nder" matches against both the words "wander" and "wonder".
To make a set of characters unacceptable, the character class is started with a carat (^). For example, the regexp "^1234567890" will match against any character that isn't a number.
And you can use the hyphen to specify a range of characters. For instance, the preceding example can be rewritten as [^0-9], and a lowercase letter can be matched with [a–z].
You can use one or more of these ranges alongside each other, so if you wanted to match a single hexadecimal digit, you could write [0-9A-F]. The brackets contain the whole expression, and represent just a single character to be matched against any of the characters specified by either of the ranges in the class. If you used [0–9][A-F], you'd match a digit followed by a letter from A to F.
Some character classes such as digits, letters, and various types of whitespace are going to come up again and again. There are some neat shortcuts for these. Here are the most common ones, and what they represent:
| Shortcut | Expansion | Description |
|---|---|---|
| \d | [0–9] | Digits 0 to 9 |
| \w | [0-9A-Za-z_] | A "word" character |
| \s | [\t\n\r] | A whitespace character (space, tab, newline or return) |
And here are the shortcuts' negative forms:
| Shortcut | Expansion | Description |
|---|---|---|
| \D | [^0-9] | Any non-digit |
| \W | [^0-9A-Za-z_] | A non-"word" character |
| \S | [^\t\n\r] | A non-blank character |
Anchors
Character classes match characters anywhere in a string, but there are certain symbols that can be used is a way to indicate the location on the string where the match must occur. These symbols are called anchors.
The two anchors are ^, which appears at the beginning of the pattern, anchoring a match to the beginning of the string, and $, which appears at the end of the pattern, anchoring it to the end of the string. To see if a string ends with a full stop (remember, the full stop is a special character) you could use a regexp like this: "\.$". Likewise, you can use "^B" to see if there's a capital "B" at the beginning of the string.
Word Boundaries
To help you properly search for words when the words may be preceded or followed by a variety of punctuation marks, there are special symbols called word boundaries. Word boundary symbols enable you to designate a pattern as having to match the beginning or ending of a word. They are required because words aren't always separated by spaces, but are sometimes separated by commas, periods, and other punctuation marks.
For example, you can use the special \b word boundary symbol to find one-letter words using the regexp "\b\w\b". Like the anchor symbols, \b doesn't actually match any character in particular, but matches the point between something that isn't a word character (\W or one end of the string) and something that is (hence \b for boundary).
Alternatives
In some cases, you may want to use a symbol that causes an either/or condition. The "either-or" operator in a regular expression is the same as the bitwise "or" operator: |. For example, to match either "yes" or "maybe" you'd use the regexp "yes|maybe".
Qualifiers
Qualifier symbols—?, +, and *—enable you to create regexps that match against a set of characters that may occur once, may occur more than once, or may even not occur at all. The simplest is ?, which matches the immediately preceding character(s) or metacharacter(s) if they either appear once or not at all. For instance, to match the word "he" or "she", you can use "s?he". Notice how the "s" and the "h" are separated by the question mark (?). That's what tells the regexp to look for either character. If the "s" doesn't appear (as in the word "he"), a match is still found.
To make a series of characters (or metacharacters) optional, group them in parentheses: you can match either "man" or "woman" with the regexp "(wo)?man".
You can match something one or more times by using the plus sign. To match an entire word without specifying how long it should be, use "\w+".
You also may have something that could occur any number of times but might not be there at all (that is, zero or one or many). For that you need what's called "Kleene's star" (the * quantifier, which is simply called the star from here on out). So, for example, to find a capital letter after any (but possibly no) spaces at the start of the string, you'd use "^\s*[a–z]".
The three qualifiers available are demonstrated by the following examples:
- hea?t Matches either "heat" or "het"
- hea+t Matches "heat", "heaat", "heaaat"...
- hea*t Matches "het", "heat", "heaat"...
Novice programmers tend to overuse combinations of dots (representing anything) and stars (representing anything or nothing), often with unexpected, and unwanted, results. Some rules of thumb:
- A regular expression should almost never start or finish with a starred character.
- Using .* or .+ in the middle of a regular expression will match as much of your string as they possibly can.
Quantifiers
Quantifiers are used to set limits and ranges on the number (quantity) of characters to be matched. For example, \s{2,3} matches against 2 or 3 spaces (or the other characters mentioned previously in the shortcut table). \s informs the regexp to look for spaces, and the curly braces surrounding the 2,3 tell the regexp to use these figures as a range: from 2 to 3 characters. Think of the 2 as the minimum number and the 3 as the maximum number. Quantifier ranges work in a similar way as the ranges specified in character classes, but the syntax is a bit different: you use the symbol for the character to match first, then within curly braces (instead of brackets) you put the quantities you want separated by a comma.
Leaving out either the maximum or the minimum (but leaving in the comma) signifies "or more" and "or fewer", respectively. For example, {2,} denotes "2 or more", while {,3} is "3 or fewer". In these cases, the same warnings apply as for the star operator.
You can put a few of these special characters together, and do some cool things. For example, you can specify exactly how many things are to be in a row by putting just that number inside the curly brackets: "\b\w{5}\b" (matches a five-letter word).
Summary of Metacharacters
Here's a summary of the metacharacters you've just seen:
Validating Data Entry
One of the most common items entered by users in Web applications is the e-mail address, which just happens to provide an excellent subject for validation because of its unique format. First, it's not easy to define a length, because you never know what could be allowed as the "name" portion of the address. Second, it contains the @ sign, which is not typically found in other common data entry items. And third, there is a requirement for a .com, .net, .org, or some other acceptable ending after the domain name.
An e-mail address is requested on Web forms so frequently because it may be the only contact information you can get from a person, and if for any reason the e-mail address isn't valid, that's the end of it; you're not going to be able to contact that person.
So there's plenty of incentive to validate e-mail addresses. And there are plenty of e-mail address validation routines around. Some are highly complex and can validate e-mail addresses with a high degree of precision; others are very simple and have a lower percentage of success. One of the simplest routines just checks to make sure the @ sign is present once; obviously, if there's no @ sign, the e-mail address cannot be valid, and if the @ sign is present more than once the address is also invalid. But users could provide an @ sign and still not give a valid address.
Regexps come in handy for creating an e-mail address validation routine because they can do quite a bit of conditional checking. For example, there should be some characters before and after the @ sign in any e-mail address. And there should be a dot (.) preceded by and followed by more characters on the right side of the @ sign. And there are some characters that aren't permitted on either side of the @ sign (such as a blank space). Regexp patterns can be constructed to detect anomalies in e-mail address structure very precisely, such as this one:
^[^@ ]+@ [^@ ]+\.[^@ \.]+$
This pattern may not make much sense right off the bat; let's spread it out so you can see what's really there. The following table identifies each symbol in order and describes what it's doing:
| Symbols | Matches |
|---|---|
| ^ | The beginning of the string... |
| [^@] | ...there is one character, which can be anything other than an ampersand or a space |
| + | ...which is repeated one or more times |
| @ | There is then an ampersand |
| [^@] | Next, there is one character that can be anything other than an ampersand or a space |
| + | ...which is repeated one or more times |
| \. | There is then a period (which must be escaped) |
| [^@\.] | There is one character that can be anything other than an ampersand, a space, or a period |
| + | ...which is repeated one or more times. The last one must be followed immediately by... |
| $ | ...the end of the string |
As complicated as this pattern appears to be, it's really a simple one and by no means foolproof. There are much more complex e-mail address/URL parsing procedures out there, and they can be very longwinded.
Next, you can use this regexp with ereg() to validate incoming e-mail addresses.
Try it Out: Validate E-mail Addresses
- Enter the following code into your Web editor:
<html> <head><title></title></head> <body> <?php //email_validation.php if (isset($_POST['posted'])) { $email = $_POST['email']; $theresults = ereg("^[^@ ]+@(^@ ]+\.[^@ \.]+$", $email, $trashed); if ($theresults) { $isamatch = "Valid"; } else { $isamatch = "Invalid"; } echo "Email address validation says $email is " .$isamatch; } ?> <form action="email_validation.php" method="POST"> <input type="hidden" name="posted" value="true"> Enter your email address for validation: <input type="text" name="email" value="name@example.com"> <input type="submit" value="Validate"> </form> </body> </html>
- Save the file as email_validation.php and close it.
- Run the script, and try out the validation process with "good" and "bad" e-mail addresses.
How it Works
The regexp pattern does the pattern matching via the ereg() function. When the form is submitted, the string supplied is captured in the $email variable (from the $_POST[email] variable). Then, ereg() is used to check whether it is valid. If it is, the word "Valid" is placed inside the $isamatch variable. If not, the word "Invalid" is placed inside the $isamatch variable. The results are then echoed out to the user:
<?php
//email_validation.php
if (isset($_POST['posted'])) {
$email = $_POST['email'];
$theresults = ereg("^[^@ ]+@[^@ ]+\.[^@ \.]+$", $email, $trashed);
if ($theresults) {
$isamatch = "Valid";
} else {
$isamatch = "Invalid";
}
echo "Email address validation says $email is " . $isamatch;
}
?>
A default e-mail address is provided within the email form field to show the user what format his data entry must have (hopefully he'll understand that he needs to replace this value with a real one).
This script is a good start on validating e-mail addresses, but it's not perfect. It allows characters after the @ sign that aren't legal in a domain name (*, for example), and it doesn't actually verify that the domain entered exists. The best way to verify e-mail addresses is to see if the server you're sending to accepts them; even a properly formatted e-mail address is invalid if the server you're sending to doesn't accept it.
Using Regexps to Validate URLS
Domain names and full URLs (Uniform Resource Locators) provide great subjects for the pattern-matching abilities of regexps. The structure of a domain name is simply a name followed by a dot followed by a domain name extension (such as .com, .net, .org, and so on), but there are limitations of the characters that can be present in the name, and there's no need to prefix the domain name with www (or anything else for that matter). URLs include domain names, but are prefixed with http://, and a complete URL may include the path (the folder names separated by slashes), the filename, and even a query string attached to the end. There are quite a few variations, and like e-mail addresses it's important to get it right.
Here's the format of a Uniform Resource Locator (URL):
- Protocol (such as ftp or http)
- Domain or server name (such as wrox.com; the www is not required)
- Folder and file path (optional in some cases, includes folder and filename separated by slashes, such
- as images/myimage.gif)
- Querystring (optional, starts with ? and then one or more name/value pairs).
Again, regexps to the rescue! This time, try this helpful (and highly aesthetically pleasing) snippet:
^[a-zA-Z0-9]+://[^ ]+$
Similar to the key line in the last sample script, this expression can be used in a line like this:
In the following example, you'll check out URLs for correct formatting.
Try it Out: Check for Correctly Formatted URLs
- Start your Web page editor and type in the following:
<html> <head></head> <body> <?php //url_validate.php if (isset($_POST['posted'])) { $url = $_POST['url']; $theresults = ereg("^[a-zA-Z0-9]+://[^]+$", $url, $trashed); if ($theresults) { $isamatch = ''Valid''; } else { $isamatch = ''Invalid''; } echo "URL validation says $url is " . $isamatch; } ?> <form action="url_validate.php" method="POST"> <input type="hidden" name="posted" value="true"> Enter your URL for validation: <input type="text" name="url" value="http://www.example.com" size="30"> <input type="submit" value="Validate"> </form> </body> </html>
- Save this file as url_validate.php and close it.
- Run the script and test the script by trying to validate several URLs, some valid and some not.
How it Works
Similar to email_validation.php, this script is mostly powered by a single regular expression via the ereg() function:
$theresults = ereg("^[a-zA-Z0-9]+://[^]+$", $url, $trashed);
The regexp pattern matches valid URLs, but unfortunately it also matches quite a few other things that are not valid URLs. Like many regexp patterns, you'll find that it is very difficult to get perfect matching without spending a lot of time working on it (some URL patterns are hundreds of characters in length, and still aren't perfect). The main idea is to try to eliminate the most common errors that people make, and be satisfied with that. After all, any slight change in what constitutes a valid URL (or e-mail address) may throw your regexp off, so what's perfect one day may be flawed the next, through no fault of your own.
Using Regexps to Check File Path Parameters
The file system and the functions PHP provides to work with files and folders are discussed in Chapter 7; but let's take a quick look at some regexp functions that you'll find helpful in protecting data you store in files (yes, files—databases aren't the only means of persistent data storage; common text files are often used to store application data for extended periods).
In persistent data storage the term persistent is used to signify data that lives on even after your application quits or the system is turned off.
These functions (and the example code you'll create) can be very useful for limiting access to certain folders and files because they make it easy to match things that are file or folder names. Although you may want to allow access to certain files and folders, there may be others that no one but the system administrator should have access to.
In the following example, you'll stop users from traversing the directory tree, by removing potentially sensitive information from the file path. A variation of the ereg() function named ereg_replace() does the heavy lifting.
To get the job done, you write a pattern that, when run with the ereg_replace() functions, replaces any "../", "/", or "\" from the path. In UNIX systems the trailing slashes are used, whereas in windows systems backslashes are used, and colons are used on Mac OS systems. Also, the code will remove any absolute paths, that is, starting with "/" or [a–z].
Try it Out: Prevent Users from Accessing Sensitive Files
- Open your Web page editor and type in the following:
<html> <head><title></title></head> <body> <?php //clean_path.php if (isset($_POST['posted'])) { $path = $_POST['path']; $outpath = reg_replace("\.[\.]+", "", $path); $outpath = ereg_replace("^[\/]+", "", $outpath); $outpath = ereg_replace("^[A-Za-z][:\ |][\/]?, "", $outpath); echo "The old path is " . $path . " and the new path is " . $outpath; } ?> <form action="clean_path.php" method="POST"> <input type="hidden" name="posted" value="true"> Enter your file path for cleaning: <input type="text" name="path" size="30"> <input type="submit" value="Clean"> </form> </body> </html>
- Save the file as clean_path.php and close it.
- Run the program.
As you can see, PHP can be used to filter out user input that is potentially dangerous as well as that which is simply incorrect.
How it Works
The first line of the program gets rid of ".." patterns (used to move up a level in the directory tree):
$path = ereg_replace("\.[\.]+", "", $inpath);
The second line eliminates trailing slashes or backslashes:
$outpath = ereg_replace("^[\/]+", "", $outpath);
The third line gets rid of DOS/Windows-style prefixes (for example "C:\"):
$outpath = ereg_replace("^[A-Za-z] [:\ | ] [\/]?" , "", $outpath);