Search This Blog

Sunday, February 28, 2010

Form Validation

Special Characters

Regular expression notation includes the use of special characters (not to be confused with HTML special characters). Special characters in regular expressions enable you to specify more advanced matches in which portions of the match may be one of a number of characters, or where the match must occur at a certain position in the string.


As you've already seen, you can use a backslash to escape certain characters' special meanings. For example, to echo a double-quote character ", you have to use the escape sequence \" (just like the addslashes() function does for the database entry strings).
The characters that are given special meaning within a regular expression, which you will need to backslash if you want to use literally, are:


. * ? + [ ] ( ) { } ^ $ | \

Any other characters automatically assume their literal meanings. For example, if you want to specifically match "..." in the preceding text samples, you'd have to say:


<?php
$words1 = "The bigdog is in the pound...";
$regexp = "pound\.\.\...";
if (ereg($regexp, $words1, $reg)) echo "Found string '$reg[0]'";
?>

If you used the regexp "pound...", you'd find it still matched the test string, but would also match in the following case, because the dot is not considered just a dot in a regex; it is considered a special character that makes anything:


<?php
$words1 = "The bigdog is in the pound but the dog is in the cornfield.";
$regexp = "pound...";
if (ereg($regexp, $words1, $reg)) echo "Found string '$reg[0]'";
?>

This returns the following:


Found string 'pound bu'

As mentioned, this happens because the dot (.) is a special character. It matches against any single character except the new line. So it matches any single characters after "pound", not just three dots in a row.

A Few Shortcuts and Options

There are several options available to you for formulating patterns to match against. Go ahead and check them out.
Character Classes: [xyz]
Square brackets surrounding a pattern of characters is called a character class, and signifies that any of that set of characters is acceptable. For example, the regexp "w[ao]nder" matches against both the words "wander" and "wonder".


To make a set of characters unacceptable, the character class is started with a carat (^). For example, the regexp "^1234567890" will match against any character that isn't a number.
And you can use the hyphen to specify a range of characters. For instance, the preceding example can be rewritten as [^0-9], and a lowercase letter can be matched with [a–z].
You can use one or more of these ranges alongside each other, so if you wanted to match a single hexadecimal digit, you could write [0-9A-F]. The brackets contain the whole expression, and represent just a single character to be matched against any of the characters specified by either of the ranges in the class. If you used [0–9][A-F], you'd match a digit followed by a letter from A to F.


Some character classes such as digits, letters, and various types of whitespace are going to come up again and again. There are some neat shortcuts for these. Here are the most common ones, and what they represent:


Shortcut
Expansion
Description
\d
[0–9]
Digits 0 to 9
\w
[0-9A-Za-z_]
A "word" character
\s
[\t\n\r]
A whitespace character (space, tab, newline or return)


And here are the shortcuts' negative forms:


Shortcut
Expansion
Description
\D
[^0-9]
Any non-digit
\W
[^0-9A-Za-z_]
A non-"word" character
\S
[^\t\n\r]
A non-blank character
Anchors
Character classes match characters anywhere in a string, but there are certain symbols that can be used is a way to indicate the location on the string where the match must occur. These symbols are called anchors.


The two anchors are ^, which appears at the beginning of the pattern, anchoring a match to the beginning of the string, and $, which appears at the end of the pattern, anchoring it to the end of the string. To see if a string ends with a full stop (remember, the full stop is a special character) you could use a regexp like this: "\.$". Likewise, you can use "^B" to see if there's a capital "B" at the beginning of the string.
Word Boundaries
To help you properly search for words when the words may be preceded or followed by a variety of punctuation marks, there are special symbols called word boundaries. Word boundary symbols enable you to designate a pattern as having to match the beginning or ending of a word. They are required because words aren't always separated by spaces, but are sometimes separated by commas, periods, and other punctuation marks.


For example, you can use the special \b word boundary symbol to find one-letter words using the regexp "\b\w\b". Like the anchor symbols, \b doesn't actually match any character in particular, but matches the point between something that isn't a word character (\W or one end of the string) and something that is (hence \b for boundary).
Alternatives
In some cases, you may want to use a symbol that causes an either/or condition. The "either-or" operator in a regular expression is the same as the bitwise "or" operator: |. For example, to match either "yes" or "maybe" you'd use the regexp "yes|maybe".
Qualifiers
Qualifier symbols—?, +, and *—enable you to create regexps that match against a set of characters that may occur once, may occur more than once, or may even not occur at all. The simplest is ?, which matches the immediately preceding character(s) or metacharacter(s) if they either appear once or not at all. For instance, to match the word "he" or "she", you can use "s?he". Notice how the "s" and the "h" are separated by the question mark (?). That's what tells the regexp to look for either character. If the "s" doesn't appear (as in the word "he"), a match is still found.


To make a series of characters (or metacharacters) optional, group them in parentheses: you can match either "man" or "woman" with the regexp "(wo)?man".


You can match something one or more times by using the plus sign. To match an entire word without specifying how long it should be, use "\w+".


You also may have something that could occur any number of times but might not be there at all (that is, zero or one or many). For that you need what's called "Kleene's star" (the * quantifier, which is simply called the star from here on out). So, for example, to find a capital letter after any (but possibly no) spaces at the start of the string, you'd use "^\s*[a–z]".
The three qualifiers available are demonstrated by the following examples:
  • hea?t Matches either "heat" or "het"
  • hea+t Matches "heat", "heaat", "heaaat"...
  • hea*t Matches "het", "heat", "heaat"...
Novice programmers tend to overuse combinations of dots (representing anything) and stars (representing anything or nothing), often with unexpected, and unwanted, results. Some rules of thumb:
  • A regular expression should almost never start or finish with a starred character.
  • Using .* or .+ in the middle of a regular expression will match as much of your string as they possibly can.
Quantifiers
Quantifiers are used to set limits and ranges on the number (quantity) of characters to be matched. For example, \s{2,3} matches against 2 or 3 spaces (or the other characters mentioned previously in the shortcut table). \s informs the regexp to look for spaces, and the curly braces surrounding the 2,3 tell the regexp to use these figures as a range: from 2 to 3 characters. Think of the 2 as the minimum number and the 3 as the maximum number. Quantifier ranges work in a similar way as the ranges specified in character classes, but the syntax is a bit different: you use the symbol for the character to match first, then within curly braces (instead of brackets) you put the quantities you want separated by a comma.


Leaving out either the maximum or the minimum (but leaving in the comma) signifies "or more" and "or fewer", respectively. For example, {2,} denotes "2 or more", while {,3} is "3 or fewer". In these cases, the same warnings apply as for the star operator.


You can put a few of these special characters together, and do some cool things. For example, you can specify exactly how many things are to be in a row by putting just that number inside the curly brackets: "\b\w{5}\b" (matches a five-letter word).
Summary of Metacharacters
Here's a summary of the metacharacters you've just seen:


Metacharacter
Meaning
[abc]
any one of the characters a, b, or c
[^abc]
any one character other than a, b or c
[a–z]
any one ASCII character between a and z
\d \D
a digit; a non-digit
\w\W
a "word" character; a "non-word" character
\s\S
a whitespace character; a non-whitespace character
\b
the boundary between a \w character and a \W character
.
any character (apart from a new line)
(abc)
the phrase "abc" as a group
?
preceding character or group may be present 0 or 1 times
+
preceding character or group is present 1 or more times
*
preceding character or group may be present 0 or more times
{x,y}
preceding character or group is present between x and y times
{,y}
preceding character or group is present at most y times.
{x,}
preceding character or group is present at least x times.
{x}
preceding character or group is present x times.
^
the beginning of the string
$
the end of the string

Validating Data Entry

One of the most common items entered by users in Web applications is the e-mail address, which just happens to provide an excellent subject for validation because of its unique format. First, it's not easy to define a length, because you never know what could be allowed as the "name" portion of the address. Second, it contains the @ sign, which is not typically found in other common data entry items. And third, there is a requirement for a .com, .net, .org, or some other acceptable ending after the domain name.


An e-mail address is requested on Web forms so frequently because it may be the only contact information you can get from a person, and if for any reason the e-mail address isn't valid, that's the end of it; you're not going to be able to contact that person.


So there's plenty of incentive to validate e-mail addresses. And there are plenty of e-mail address validation routines around. Some are highly complex and can validate e-mail addresses with a high degree of precision; others are very simple and have a lower percentage of success. One of the simplest routines just checks to make sure the @ sign is present once; obviously, if there's no @ sign, the e-mail address cannot be valid, and if the @ sign is present more than once the address is also invalid. But users could provide an @ sign and still not give a valid address.


Regexps come in handy for creating an e-mail address validation routine because they can do quite a bit of conditional checking. For example, there should be some characters before and after the @ sign in any e-mail address. And there should be a dot (.) preceded by and followed by more characters on the right side of the @ sign. And there are some characters that aren't permitted on either side of the @ sign (such as a blank space). Regexp patterns can be constructed to detect anomalies in e-mail address structure very precisely, such as this one:


^[^@ ]+@ [^@ ]+\.[^@ \.]+$

This pattern may not make much sense right off the bat; let's spread it out so you can see what's really there. The following table identifies each symbol in order and describes what it's doing:


Symbols
Matches
^
The beginning of the string...
[^@]
...there is one character, which can be anything other than an ampersand or a space
+
...which is repeated one or more times
@
There is then an ampersand
[^@]
Next, there is one character that can be anything other than an ampersand or a space
+
...which is repeated one or more times
\.
There is then a period (which must be escaped)
[^@\.]
There is one character that can be anything other than an ampersand, a space, or a period
+
...which is repeated one or more times. The last one must be followed immediately by...
$
...the end of the string


As complicated as this pattern appears to be, it's really a simple one and by no means foolproof. There are much more complex e-mail address/URL parsing procedures out there, and they can be very longwinded.


Next, you can use this regexp with ereg() to validate incoming e-mail addresses.


Try it Out: Validate E-mail Addresses
Start example
  1. Enter the following code into your Web editor:
    <html>
    <head><title></title></head>
    <body>
    <?php
    //email_validation.php
    if (isset($_POST['posted'])) {
    
       $email = $_POST['email'];
       $theresults = ereg("^[^@ ]+@(^@ ]+\.[^@ \.]+$", $email, $trashed);
       if ($theresults) {
          $isamatch = "Valid";
       } else {
          $isamatch = "Invalid";
       }
    
       echo "Email address validation says $email is " .$isamatch;
    }
    ?>
    <form action="email_validation.php" method="POST">
    <input type="hidden" name="posted" value="true">
    Enter your email address for validation:
    <input type="text" name="email" value="name@example.com">
    <input type="submit" value="Validate">
    </form>
    </body>
    </html>
    
  2. Save the file as email_validation.php and close it.
  3. Run the script, and try out the validation process with "good" and "bad" e-mail addresses.


End example

How it Works

The regexp pattern does the pattern matching via the ereg() function. When the form is submitted, the string supplied is captured in the $email variable (from the $_POST[email] variable). Then, ereg() is used to check whether it is valid. If it is, the word "Valid" is placed inside the $isamatch variable. If not, the word "Invalid" is placed inside the $isamatch variable. The results are then echoed out to the user:


<?php
//email_validation.php
if (isset($_POST['posted'])) {
  $email = $_POST['email'];
  $theresults = ereg("^[^@ ]+@[^@ ]+\.[^@ \.]+$", $email, $trashed);
  if ($theresults) {
     $isamatch = "Valid";
  } else {
     $isamatch = "Invalid";
  }
  echo "Email address validation says $email is " . $isamatch;
}
?>

A default e-mail address is provided within the email form field to show the user what format his data entry must have (hopefully he'll understand that he needs to replace this value with a real one).


This script is a good start on validating e-mail addresses, but it's not perfect. It allows characters after the @ sign that aren't legal in a domain name (*, for example), and it doesn't actually verify that the domain entered exists. The best way to verify e-mail addresses is to see if the server you're sending to accepts them; even a properly formatted e-mail address is invalid if the server you're sending to doesn't accept it.

Using Regexps to Validate URLS

Domain names and full URLs (Uniform Resource Locators) provide great subjects for the pattern-matching abilities of regexps. The structure of a domain name is simply a name followed by a dot followed by a domain name extension (such as .com, .net, .org, and so on), but there are limitations of the characters that can be present in the name, and there's no need to prefix the domain name with www (or anything else for that matter). URLs include domain names, but are prefixed with http://, and a complete URL may include the path (the folder names separated by slashes), the filename, and even a query string attached to the end. There are quite a few variations, and like e-mail addresses it's important to get it right.
Here's the format of a Uniform Resource Locator (URL):


  • Protocol (such as ftp or http)
  • Domain or server name (such as wrox.com; the www is not required)
  • Folder and file path (optional in some cases, includes folder and filename separated by slashes, such
  • as images/myimage.gif)
  • Querystring (optional, starts with ? and then one or more name/value pairs).
Again, regexps to the rescue! This time, try this helpful (and highly aesthetically pleasing) snippet:


^[a-zA-Z0-9]+://[^ ]+$

Similar to the key line in the last sample script, this expression can be used in a line like this:
$theresults = ereg("^[a-zA-Z0-9]+://[^ ]+$", $intext, $trashed);

In the following example, you'll check out URLs for correct formatting.


Try it Out: Check for Correctly Formatted URLs
Start example
  1. Start your Web page editor and type in the following:
    <html>
    <head></head> <body>
    <?php
    //url_validate.php
    if (isset($_POST['posted'])) {
       $url = $_POST['url'];
       $theresults = ereg("^[a-zA-Z0-9]+://[^]+$", $url, $trashed);
       if ($theresults) {
          $isamatch = ''Valid'';
      } else {
          $isamatch = ''Invalid'';
      }
      echo "URL validation says $url is " . $isamatch;
    }
    ?>
    <form action="url_validate.php" method="POST">
    <input type="hidden" name="posted" value="true">
    Enter your URL for validation:
    <input type="text" name="url" value="http://www.example.com" size="30">
    <input type="submit" value="Validate">
    </form>
    </body>
    </html>
    
  2. Save this file as url_validate.php and close it.
  3. Run the script and test the script by trying to validate several URLs, some valid and some not.
End example


How it Works
Similar to email_validation.php, this script is mostly powered by a single regular expression via the ereg() function:


$theresults = ereg("^[a-zA-Z0-9]+://[^]+$", $url, $trashed);

The regexp pattern matches valid URLs, but unfortunately it also matches quite a few other things that are not valid URLs. Like many regexp patterns, you'll find that it is very difficult to get perfect matching without spending a lot of time working on it (some URL patterns are hundreds of characters in length, and still aren't perfect). The main idea is to try to eliminate the most common errors that people make, and be satisfied with that. After all, any slight change in what constitutes a valid URL (or e-mail address) may throw your regexp off, so what's perfect one day may be flawed the next, through no fault of your own.

Using Regexps to Check File Path Parameters

The file system and the functions PHP provides to work with files and folders are discussed in Chapter 7; but let's take a quick look at some regexp functions that you'll find helpful in protecting data you store in files (yes, files—databases aren't the only means of persistent data storage; common text files are often used to store application data for extended periods).
In persistent data storage the term persistent is used to signify data that lives on even after your application quits or the system is turned off.
These functions (and the example code you'll create) can be very useful for limiting access to certain folders and files because they make it easy to match things that are file or folder names. Although you may want to allow access to certain files and folders, there may be others that no one but the system administrator should have access to.


In the following example, you'll stop users from traversing the directory tree, by removing potentially sensitive information from the file path. A variation of the ereg() function named ereg_replace() does the heavy lifting.


To get the job done, you write a pattern that, when run with the ereg_replace() functions, replaces any "../", "/", or "\" from the path. In UNIX systems the trailing slashes are used, whereas in windows systems backslashes are used, and colons are used on Mac OS systems. Also, the code will remove any absolute paths, that is, starting with "/" or [a–z].


Try it Out: Prevent Users from Accessing Sensitive Files
Start example
  1. Open your Web page editor and type in the following:
    <html>
    <head><title></title></head>
    <body>
    <?php
    //clean_path.php
    if (isset($_POST['posted'])) {
       $path = $_POST['path'];
       $outpath = reg_replace("\.[\.]+", "", $path);
       $outpath = ereg_replace("^[\/]+", "", $outpath);
       $outpath = ereg_replace("^[A-Za-z][:\ |][\/]?, "", $outpath);
       echo "The old path is " . $path . " and the new path is " . $outpath;
    }
    ?>
    <form action="clean_path.php" method="POST">
    <input type="hidden" name="posted" value="true">
    Enter your file path for cleaning:
    <input type="text" name="path" size="30">
    <input type="submit" value="Clean">
    </form>
    </body>
    </html>
    
  2. Save the file as clean_path.php and close it.
  3. Run the program.


End example
As you can see, PHP can be used to filter out user input that is potentially dangerous as well as that which is simply incorrect.

How it Works

The first line of the program gets rid of ".." patterns (used to move up a level in the directory tree):


$path = ereg_replace("\.[\.]+", "", $inpath);
The second line eliminates trailing slashes or backslashes:
$outpath = ereg_replace("^[\/]+", "", $outpath);

The third line gets rid of DOS/Windows-style prefixes (for example "C:\"):


$outpath = ereg_replace("^[A-Za-z] [:\ | ] [\/]?" , "", $outpath);

Saturday, February 27, 2010

Form Validation

Here's a maxim for you: Users can, and will, enter anything and/or nothing in your HTML forms, no matter how easy you make it to use the forms or what you expect the users to enter. Likewise, malicious users will deliberately enter oddball data to try to break your applications. Form validation on both the client and the server is your main defense.
One way to deal with this is to limit the values permitted in a certain text box. In loan.php, for example, you'll recall that the program passed the person's age into a variable called $age, and the PHP page then validated that value against a more realistic range of age values:


if ($age < 1 or $age > 130)
{
     echo "Incorrect age value entered. Please enter an age between 1 and 130.";
     break;
}

Of course, you're not limited to simply informing the user he's entered an inappropriate age value. You can take any other steps you want at this point, now that you've identified an inappropriate value.

Using the exit Statement

Performing validation is great, but if you encounter incorrect data, what can you do? Sometimes you just need to quit processing, and that's what the exit statement is for. Although the break statement exits the current structure, the exit statement ends all processing. You can use either method to end processing, but exit is much more abrupt, so be careful where you use it. No further HTML, PHP code, or text is executed after an exit is encountered, and unless you're very careful, the user may get back an unexpected result, such as a partially completed page. So why don't you rebuild the loan application example and tighten it up a bit more against possible user errors by incorporating some form validation logic.


Try it Out: Form Validation
Start example
  1. Open loan.php (from Chapter 4), save it as loan_fv.php, and insert the following changes:
    <html
      <head><title></title></head>
      <body>
    <b>Namllu credit bank loan form</b>
    <?php
    if (isset($_POST['posted'])) {
    
       $age = $_POST['age'];
       $first_name = $_POST['first_name'];
       $last_name = $_POST['last_name'];
       $address = $_POST['address'];
       $loan = $_POST['loan'];
       $month = $_POST['month'];
    
       //validation
       if ($age < 10 OR $age > 130)
       {
         echo "Incorrect Age entered - Press back button to try again";
         exit;
       }
       if; ($first_name == "" or $last name = = "")
       {
         echo "You must enter your name - Press back button to try again";
         exit;
       }
       if ($address == "")
       {
         echo "You must enter your address - Press back button to try again";
         exit;
       }
       if ($loan != 1000 and $loan != 5000 and $loan != 10000)
       {
            echo "You must enter a loan value -- Press back button to try again";
            exit;
       }
       $duration = 0;
       switch ($loan) {
       case "1000";
          $interest = 5;
          break;
       case "5000";
          $interest = 6.5;
          break;
       case "10000";
          $interest = 8;
          break;
       default:
          echo "You didn't enter a loan package!<hr>";
          exit;
       }
       while ($loan > 0)
        {
          $duration = $duration + 1;
          $monthly = $month - ($loan * $interest / 100);
          if ($monthly <= 0)
           {
             echo "You need larger repayments to pay off your loan!<hr>";
             exit;
           }
       $loan = $loan - $monthly;
       }
        echo "This would take you $duration months to pay this off
      at the interest rate of $interest percent.<hr>";
    
      }
      ?>
      <form method="POST" action="loan_fv.php">
      <input type="hidden" name="posted" value="true">
      <br>
    
      First Name:
      <input name="first_name" type="text">
      Last Name:
      <input name="last_name" type="text">
      Age:
      <input name="age" type="text" size="3">
      <br>
      <br>
      address:
      <textarea name="address" rows="4" cols="40">
      </textarea>
      <br>
      <br>
      what is your current salary?
      <select name="salary">
      <option value=0>under $10000</option>
      <option value=10000>$10,000 to $25,000</option>
      <option value=25000>$25,000 to $50,000</option>
      <option value=50000>over $50,000</option>
       </select>
       <br>
       <br>
       How much do you want to borrow? <br><br>
       <input name="loan" type="radio" value="1000">Our $1,000 package at 5.0%
    interest
       <br>
       <input name="loan" type="radio" value="5000">Our $5,000 package at 6.5%
    interest
       <br>
       <input name="loan" type="radio" value=" 10000">Our $10,000 package at 8.0% interest
       <br>
       <br>
       How much do you want to pay a month?
       <input name="month" type="text" size="5">
       <br>
       <br>
       <input type="submit" value="calculate">
       </form>
       </body>
       </html>
    
  2. Save this file as loan_fv.php and then close it.
  3. Open the file in your browser and enter some values that are out of bounds or otherwise incorrect. 


End example

How it Works

You could enter someone else's address, or an age other than your own, and there's no way PHP can check for incorrect entries of this type. But what we can do with our new code is make sure that the user hasn't simply forgotten to add a detail, or maliciously supplied obviously wrong information about their age. We use four if..then..else statements to do this. The first checks to see whether the age entered is between 10 and 130, otherwise we can be pretty sure that the person is lying:


if ($age < 10 OR $age > 130)
{
   echo "Incorrect Age entered - Press back button to try again";
   exit;
}

If the condition is not met, you display the "error" message and exit there. If the condition is met, you don't need to do anything further.


The second if statement checks for first and last names being present. The "" denotes an empty string (a string with no characters in it), and this is how you check for one:


if ($first_name == "" or $last_name == "")
{
    echo "You must enter your name - Press back button to try again";
    exit;
}
Do the same check for the $address variable:
if ($address == "")
{
   echo "You must enter your address - Press back button to try again";
   exit;
}
And then check the values of the radio buttons to validate that one was selected.
if ($loan != 1000 and $loan != 5000 and $loan != 10000)
{
   echo "You must enter a loan value - Press back button to try again";
   exit;
}

If the $load variable is not equal to any of these values, you know that the user didn't select a value.

Preventing User-Inserted HTML: HTMLSpecialChars()

Another way users can abuse your applications is by entering HTML code directly in as part of their data entry into your form fields. This works because HTML is plain text, and when it comes back out, it's processed by the browser just like any other plain text characters that form HTML code. For example, if you have created an application that accepts user input for directory listings, a slick user might insert <b> before his name, and </b> after his name, so that his listing would have his name in bold. Nice trick, but unfair to the other users.


While this exact situation might not be a serious problem, under other circumstances it could be used to break your HTML or otherwise thwart your intent. Fortunately, PHP provides the HTMLSpecialChars() function, which changes HTML tags into special characters (more on this shortly). It just requires a string argument to work, like this:


$String = HTMLSpecialChars("<b>The bold tags won't appear after processing </b>");

A variable name will also do the trick:


$String ="<B> The bold tags won't appear after processing </B>";

$String = HTMLSpecialChars($String);

The HTMLSpecialChars() function converts any HTML tags into the what are called special characters. Special characters in HTML are simply entities that represent the HTML characters they have been translated from. For example, <b> is translated into &lt; (for the less-than sign), the letter b, and &gt; (for the greater-than sign). When the browser receives these special characters, it displays them on the screen as the HTML characters they represent, instead of processing and rendering them as ordinary HTML tags.


This feature is often used when you want to make a Web page that discusses HTML tags (when you need to display the tags in plain text rather than letting the browser process them), but it certainly comes in handy for preventing users from entering their own HTML into your PHP application.


Ultimately, there is no limit to the degree of error checking you can perform. In fact, you could preset all variable values in advance, but then what would be the point of providing a form? You could just automatically do everything in advance. Seriously, it really helps to try to think like a user. The ratio of users to hackers is pretty small, meaning that most users who break your application do so unintentionally. They'll either not understand what you meant, or simply make a mistake. So attempt to think like your users, try your applications out on family members and friends, and try to anticipate all the possible responses, accounting for them as comprehensively as possible. The extra work you put in will be well worth it.

String Validation and Regular Expressions

All the data your application receives from the user's browser is formatted as strings, as you know. So PHP's wealth of string manipulation functions come in very handy for validating data entered by the user, or reformatting string data into sequences acceptable as other data types. But PHP contains other functions, called regular expression functions, that are a quantum leap more powerful when it comes to manipulating data. You'll explore both of these subjects in the next few sections.

String Validation

You can use PHP's string manipulation functions in a variety of clever ways to perform basic validation of data being entered by users. In this section you'll see a few common ones, but remember, you can easily devise your own, using the functions covered here, other PHP functions, or combinations of all of them. You've already had a go at a few of these (like strlen() and substr()) in Chapter 2 and in some of our other example scripts, but it doesn't hurt to use them again in a validation context.
Using strlen()
Some data is always a certain length, such as U.S. ZIP codes, which are always 5 digits or, in the case of ZIP+4, 5 digits, a hyphen, and 4 more digits. So one way to validate data entered as postal codes is to use the strlen() function like this:
If (strlen($postal_code) == 5 or strlen($postal_code) == 10) {
   //check to make sure if 10 the dash is in spot 6
     //do something
} else {
   //send error message
}
Using strstr()
In the preceding example you needed to find out if the character in the sixth spot of the incoming string was actually a hyphen. The strstr() function is useful for determining this because as its name implies, it looks for a string within a string. In this example, you want it to look for a one-character string consisting of a hyphen. This code would work:
if (strlen($postal_code) == 5
  or strlen($postal_code) == 10) {
    if (strlen($postal_code) == 10) {
       if (strstr($postal_code, "-")) {
          //do something
       }
    }
} else {
    //send error message
    }
You also want to use the strstr() function to check for a space at the sixth position, just in case the user omitted the hyphen but entered a space.
Using substr()
Continuing with the same example, suppose you want to separate out the +4 portion of the ZIP+4. You know that this portion begins at spot 7, and should be four characters long. You could employ the substr() function for this purpose. As arguments, the substr() function takes the string in question, an integer representing the position to start looking at, and an optional integer specifying the number of characters to return. Code like this would work:


If (strlen($postal_code) == 5
 or strlen($postal_code) == 10) {
   If (strlen($postal_code) == 10) {
      $plus4_portion = substr($postal_code,7,4)
   }
} else {
  //send error message
}
Using addslashes() and stripslashes()
For applications in which you allow the user to enter data that may be going into a database, it's a great idea to use the addslashes() function. This function adds slashes wherever it finds string characters that might cause a problem for database entries (the ', ", \, and NULL characters). Later, when you output the data to the user again, you must use the stripslashes() function to—you guessed it—remove those slashes that addslashes() inserted.
Why is there a function dedicated to protecting your database entries? If you've ever composed a SQL string for inserting a record (or for just about any database function, for that matter), you're aware that SQL tends to be very intolerant of misplaced apostrophes (if you've not used SQL yet, you'll get that experience in Chapters 9–11).


So perhaps you're SQL string is supposed to look like this:


$query = "INSERT INTO clients (username) values('$username')";
mysql_query($query);

Now, if the user enters joeblow for his username, the query should run without a hitch. But if the user enters joe'blow, your query blows up because your database sees:


$query = "INSERT INTO clients (username) values('joe'blow')";

It might be a little hard to see, but if you look carefully you can see that the apostrophe in the user's username looks to the database like a broken set of delimiters. How can you ensure this doesn't happen? Apply the addslashes() function to the value entered by the user, and the username stored in the database will be correct. For example, using this would work:


$username = addslashes($username);
$query = "INSERT INTO clients (username) values('$username')";
mysql_query($query);

When this code runs, if the user enters joe'blow, addslashes converts that value to joe\'blow. The slash escapes the apostrophe, thereby causing your database to accept both characters (the slash and the apostrophe in combination) without blowing up your query.


But you must take care to use stripslashes() when outputting the username to the user again. Otherwise the user sees joe\'blow as his username, and if he doesn't remove the slash, the next time he edits his username he'll end up with joe\\\'blow entered because addslashes will try to escape both characters, and this could cause problems.


And you also must make sure to use addslashes anywhere that the user's username is used, because although it is properly seen on the screen as joe'blow, it is stored as joe\'blow, and only by using the addslashes() function again (like when he logs in) can you be sure the database will match up the values correctly.

Regular Expressions

Finding a specific string within another one is quite helpful in some situations, and of course if you know what the string you're looking for, the substr() function can handle the situation for you. But suppose you don't know exactly what string you'll be looking for. If you even know a little about the string, you can use regular expressions and PHP's regular expression functions to help you find it.


Suppose, for example, that you know you'll be looking for a string made up of all alphabetic characters and no numerical characters. In that case, you at least know the pattern you're looking for, and that's enough to start with. The simplest pattern is a word or a single character, as in the earlier strstr() example, which is looking for the hyphen:


if (strstr($postal_code, "-")) {

And to separate out data values in a string (a string of comma-separated values, for instance), you can use the explode() function, which separates characters in a string by a characters and puts the results into an array. The function takes two arguments, first the string to separate on and then the name of the array into which to put the results. You can use explode() with a simple if statement to test for the existence of a particular word within a string, as shown in the following code:


<?php
$words = "you, should, vote, happily";
$wordarray = explode(",", $words);
foreach ($wordarray as $word) {
   if ($word == "vote") {
      echo "Found string 'vote'";
   }
}
?>

But although you can use a simple PHP function like strstr() to find a matching character within a larger string of characters, or the explode() function for slightly more complex matching operations, you'll often find the need for much more complex patterns to match. That's where regular expressions come in handy.


Regular expressions, called regexps, are like a mini-programming language for creating very powerful patterns. They use a special notation to form the patterns that are used to match the values (or parts of values) that you provide. Certain characters take on special meanings in the context of a regexp, enabling you to broaden or narrow matches against sub-strings in the data. Some regexps will find characters belonging to a specified group; others find characters repeated a certain number of times. Regular expressions necessarily follow certain rules of syntax, which will be outlined as you read on.
Regular expressions are not limited to PHP. Languages such as Perl and Python, along with UNIX utilities like sed and egrep use the same notation for finding patterns in text. PHP's regular expression functions that follow Perl notation are called PCRE functions and begin with preg (for Perl Regular Expression), whereas ordinary PHP regular expression function are termed Posix-Extended regular expression functions. Don't use the ordinary (posix-extended) PHP regexp functions on binary data (they're not binary-safe); use the PCRE regexp functions instead.
So let's take a look at how to pattern match with some of the PHP regular expression functions, starting with the ereg() function.

Using ereg()

It works (after a fashion), but it's clunky, complicated, hard coded (the word "vote" is actually part of the code, instead of coming as input; indeed, the entire array is hard-coded), and worse still, the explode() function actually keeps all the punctuation—the string "you" won't be found but "you," (with the comma) would. This looks like a difficult problem, but it should be easy. Here's how it looks using a regular expression:


<?php
$words = "you, should, vote, happily";
if (ereg("vote", $words)) {
   echo "Found string 'vote'";
}
?>

Use the PHP function ereg() and just specify the pattern (the word you want to match that constitutes the actual regexp) and the string you want to match it against. It returns True if the pattern match was successful (in this case, on finding the character sequence "vote" in the string held by $words) and False if it wasn't.


You can also specify a third argument in ereg(): the name of an array, which is used to store successfully, matched expressions. Here's the preceding example modified to make use of it like this:


<?php
$words = "you, should, vote, happily";
if (ereg("vote", $words, $reg)) echo "Found string '$reg[0]'";
?>

Literal text written as a string is the simplest regular expression of all to look for, but you don't have to search for just the one word—you could look for any particular phrase. 


However, all the characters you're searching for must match exactly—words (with correct capitalization), numbers, punctuation, and even whitespace:


<?php
$words = "Vote twice or more if you can.";
if (ereg("twice if", $words, $reg)) echo "Found string '$reg[0]'";
?>
This string won't match, because it's not an exact match. Similarly, spaces inside the pattern are significant:


<?php
$words1 = "The bigdog is in the pound...";
$words2 = "...but the dog is in the cornfield";
$regexp = " dog";
if (ereg($regexp, $words1, $reg)) echo "Found string '$reg[0]'";
if (ereg($regexp, $words2, $reg)) echo "Found string '$reg[0]'";
?>

This finds only the second dog because both ereg() calls are specifically looking for a space followed by the three letters "d", "o", and "g".