One key characteristic of databases that you haven't looked at so far is how a database organizes data for users to interact with. Whenever you use the database engine to retrieve, modify, or add new data, you need to know what to ask for—moreover, when you come to design a database for yourself, you need to tell the database engine what sort of data to expect where, and how it's all meant to hang together.
Most of the databases you considered earlier use what's known as a relational data model. As such, they're known as relational databases (or Relational Database Management Systems—RDBMS). They arrange data into tables, each of which is divided into rows and columns.
In database parlance, each row in a table (apart from the heading) represents a data record: a set of intrinsically connected pieces of data. Likewise, each column represents a field: a specific type of data that has the same significance for each record in the table.
For all practical purposes, the term row is synonymous with record, whereas column is synonymous with field. This is useful to bear in mind when visualizing tables.
Suppose that the manager of a touch-rugby team (don't worry if you aren't familiar with touch rugby —the sport is incidental to the topic and won't affect your understanding of databases in any way) sets up a database so that he can track the matches in which his players compete. Naively, he asks each player to enter his details into the database after each match. After the second round of matches, the manager's table looks like this:
| Player_Number | Name | Phone_number | Date_Played | Nickname |
|---|---|---|---|---|
| 42 | David | 555-1234 | 03/03/04 | Dodge |
| 6 | Nic | 555-3456 | 03/03/04 | Obi-d |
| 2 | David | 555-6543 | 03/03/04 | Witblitz |
| 14 | Mark | 555-1213 | 03/03/04 | Greeny |
| 2 | David | 555-6543 | 02/25/04 | Witblitz |
| 25 | Pads | 555-9101 | 02/25/04 | Pads |
| 6 | Nic | 555-3456 | 02/25/04 | Obi-d |
| 7 | Nic | 555-5678 | 02/25/04 | Nicrot |
He soon realizes that this is going to make for a huge table after everyone on the team has played an entire season's worth of games. As you can see, the structure of the table is inefficient because each player's details—number, name, phone number, and so on—are entered every time he plays a match.
Such redundancy is undesirable in a database. For example, say that the player with the number 6 keeps dropping the ball, and his teammates decide to give him a new nickname (which won't be mentioned here). To update the table, every one of this player's records would have to be modified to reflect his new nickname.
In addition, every time a player enters his details after a match, all of that information is consuming valuable space on the hard drive. Redundancy is terribly inefficient, wasting a great deal of time and space.
Fortunately, in the early 1970s, Dr. E. F. Codd came up with a unique and powerful way to alleviate this type of problem. He stipulated a set of rules that, when applied to data, ensure that your database is well designed. In truth, he stipulated no small number of rules, but for most purposes you need only concern yourself with the first few. These requirements can be divided up into what are called normal forms, and making sure your data complies with these normal forms goes a long way to ensuring good relational database design. Let's take a look at what normalization is all about.
Normalization
Normalization is defined as "the process of breaking up the data into several tables, so as to minimize the number of times you have to repeat the same data." The normal forms are degrees of normalization, and they are governed by an elegant set of rules that can be summarized as follows:
First Normal Form (1NF)
- Create a new table for each new set of related data you want to store.
- Eliminate repeating information in an individual table.
- Uniquely identify each record with a primary key.
For example, in 1NF you could split the preceding table into a player information table:
| Player_Id | Name | Phone_number | Nickname |
|---|---|---|---|
| 42 | David | 555-1234 | Dodge |
| 6 | Nic | 555-3456 | Obi-d |
| 14 | Mark | 555-1213 | Greeny |
| 2 | David | 555-6543 | Witblitz |
| 25 | Pads | 555-9101 | Pads |
| 6 | Nic | 555-3456 | Obi-d |
| 7 | Nic | 555-5678 | Nicrot |
And match log table:
| Player_Id | Date_Played |
|---|---|
| 42 | 03/03/04 |
| 6 | 03/03/04 |
| 2 | 03/03/04 |
| 14 | 03/03/04 |
| 2 | 25/02/04 |
| 25 | 25/02/04 |
| 6 | 25/02/04 |
| 7 | 25/02/04 |
Notice how the original table splits naturally into these two new tables. This is because the original table provided data about two distinct things (or entities): players and matches. Each new table contains data concerning just one entity. The split into entity tables is an important part of the normalization process.
Now you can take another step toward conforming to the first normal form: eliminating repeating information in an individual table. There's some redundant information in the player information table that can be removed to give you the following:
| Player_Id | Name | Phone_number | Nickname |
|---|---|---|---|
| 42 | David | 555-1234 | Dodge |
| 6 | Nic | 555-3456 | Obi-d |
| 2 | David | 555—6543 | Witblitz |
| 14 | Mark | 555-1213 | Greeny |
| 25 | Pads | 555-9101 | Pads |
| 7 | Nic | 555-5678 | Nicrot |
Are you done with this table yet? There's no longer any duplicated information, and the table represents only related information (information about the player entities). But what about a unique identifier? Each entity must have (at least) one unique field—in other words, a field in which no entries are repeated. Because of the exclusive nature of this field, each entry in it uniquely identifies (IDs) each record in the table. The field used to ID records is often known as the primary key. Only one primary key is permitted per table.
Remember, by moving the player information to one table and the match log entries to another, you simplify the process of modifying a player's details—you only have to modify one record in the player table.
You could add a unique key, which would assign a unique number to each record, but let's look at the information closely. In any given sport, a player's number must be unique, so you know that all the values in the Player_Id field are unique. But the Nickname field is also unique because each player can only have one nickname at a time. Which one should you choose? You've already seen that nicknames can change, based on performance, so the Player_Id field is the sensible choice for a primary key.
What about the match log table? Unfortunately, the Player_Id field is no longer unique because each player plays on many dates. Obviously, the date field can't be unique either, so you can't make a unique identifier out of either field. In this case, you need to add one yourself (you'll learn how later in the book). Once that's done, both tables are in first normal form. Adding an ID field could have hidden uses, especially if you wanted to join the information in the two fields of the match log table to create a unique key that you could relate to some other data (the number of tries scored per player per match, for example).
The second normal form has two goals, which are applicable only after you've achieved 1NF.
Second Normal Form (2NF)
- Create new tables for each new group of values that apply to many records.
- Relate new tables to existing tables with a foreign key (a key that identifies records in a different table).
Because both example tables contain a common field (Player_Id), whose values correspond between tables, you have a way to join them together—you don't actually need a foreign key to accomplish this, but there are definitely advantages to using foreign keys (For a more in-depth discussion on them, please go to www.mysql.com/doc/en/ANSI_diff_Foreign_Keys.html for a discussion on MySQL foreign keys.) This means that records in one table have a well-defined relationship to records in the other. The database now has all the information it needs to answer complex queries involving both tables, which it couldn't answer by looking at either table in isolation. With tables related like this, the database can truly be termed relational. The specific relationship described here is also called a one-to-many relationship. For each record in the player information table, you may have many records in the match log table. The ability to construct meaningful one-to-many relationships is a good indicator that your database meets the requirements of the second normal form.
Third Normal Form (3NF)
The third normal form's goal is to eliminate fields that do not depend on the primary key.
This is where the real world and the elegance of theory can collide. To some extent, it reverses the premise of a master table relationship by suggesting that anytime you may have data repeat in a record such as ZIP codes in a list of addresses, that data should be stored in a separate table and linked back to the individual records that use them with a new keyed relationship. Make no mistake; the purists will chastise you roundly for not doing it this way. But the truth is, database design is often a series of trade-offs; and you do sometimes have to weigh design idealism against a novel but useful concept—simplicity. The requirements of the third normal form simply may not satisfy your needs; and it's all right for you to decide on a case-by-case basis what will work best for any given situation.
Other Normal Forms
Believe it or not, there are more normal forms, the fourth and fifth among them. The fourth normal form states that you should isolate independent multiple relationships. This rule primarily seeks to deal with the problems of many-to-many relationships, where many records may have relationships to intersecting records in another table that may in turn link back to multiple records in the first table. Confused? Don't feel bad. This is where things get pretty hairy and you don't need to hold yourself accountable to this standard. The concept is placed here because you may very well encounter this situation down the road and it's useful as a strong suggestion for dealing with these issues.
Talking to Databases with SQL
At this point, you probably aren't at all surprised to learn that MySQL itself is a relational database. In fact, you're about to start applying the principles we've just been discussing with your very own database engine. First, though, you need to learn a little about the language used to interact with relational databases—a language called SQL.
SQL, the Structured Query Language, is a standard command set used to communicate with a relational database management system on any given platform. Every task such as creating databases or tables, as well as saving, retrieving, deleting, and updating data from databases, is done via SQL statements. Implementation of SQL features vary among RDBMS vendors, but because the basic concepts are identical, applying skills learned on one platform to another should be no more difficult than porting a computer program written for one platform to another using the same language. In this section you'll examine some basic features of SQL: data types, indexes, keys, and queries.
There are some differences in the way MySQL implements certain things and the ANSI SQL standard. You can check out a list ofthese differences at www.mysql.com/doc/en/Differences_from_ANSI.html.
SQL Data Types
When you create a database table, the type and size of each field must be defined. A field is similar to a PHP variable except that you can store only the specified type and size of data in a given field. Therefore, unlike PHP variables, you can't insert characters into an integer field, for example. The three usual sets of data types are supported in MySQL: numeric, date/time, and characters, which are described in the following tables:
| Numeric Data Types | Description | Range/Format |
|---|---|---|
| INT | Normal-sized integer | (-231 to 231 -1), or (0 to 232 -1) if UNSIGNED |
| TINYINT | Very small integer | (-27 to 27 -1), or (0 to 28 -1) if UNSIGNED |
| SMALLINT | Small integer | ( -215 to 215 -1), or (0 to 28 -1) if UNSIGNED |
| MEDIUMINT | Medium-sized integer | (-223 to 223 -1), or (0 to 224 • -1) if UNSIGNED |
| BIGINT | Large integer | (-263 to 263 -1), or (0 to 264 -1) if UNSIGNED |
| FLOAT | Single-precision floating-point number | Minimum non-zero ±1.176 × 10 - 38; maximum non-zero ±3.403 × 10 + 38 |
| DOUBLE/REAL | Double-precision floating-point number | Minimum non-zero ±2.225 × 10 - 308; maximum non-zero ±1.798 × 10 + 308 |
| DECIMAL | Float stored as string | Maximum range same as DOUBLE |
| Date/Time Data Types | Description | Range/Format |
|---|---|---|
| DATE | Date | YYYY-MM-DD format. Range 1000-01-01 to 9999-12-31 |
| DATETIME | Date and time | YYYY-MM-DD hh:mm:ss format. Range 1000-01-01 00:00:00 to 9999-12-31 23:59:59 |
| TIMESTAMP | Timestamp | YYYYMMDDhhmmss format. Range 19700101000000 to sometime in 2037 |
| TIME | Time | hh:mm:ss format. Range -838:59:59 to 838:59:59 |
| YEAR | Year | YYYY format. Range 1900 to 2155 |
| Character Data Types | Description | Range/Format |
|---|---|---|
| CHAR | Fixed-length string | 0–255 characters |
| VARCHAR | Variable-length string | 0–255 characters |
| BLOB | Binary Large Object | Binary data 0-65535 bytes long |
| TINYBLOB | Small BLOB value | Binary data 0–255 bytes long |
| MEDIUMBLOB | Medium-sized BLOB | Binary data 0-16777215 bytes long |
| LONGBLOB | Large BLOB value | Binary data 0-4294967295 bytes long |
| TEXT | Normal-sized text field | 0-65535 bytes |
| TINYTEXT | Small text field | 0–255 bytes |
| MEDIUMTEXT | Medium-sized text | 0-16777215 bytes |
| LONGTEXT | Large text field | 0-4294967295 bytes |
| ENUM | Enumeration | Column values are assigned one value from a set list |
| SET | Set value(s) | Column values are assigned zero or more values from a set list |
The difference between a CHAR and VARCHAR type field is that the former stores a fixed-length value no matter how short it may be, whereas the latter stores exactly as many bytes as necessary to keep a given value. Suppose you insert the string dodge into the fields defined as char_field CHAR(10) and varchar_field(10), for example. They will store the same string slightly differently:
char_field: 'dodge '// five blank spaces are right padded
varchar_field: 'dodge' // no space
It follows that declaring character fields as VARCHAR type will save you some disk space. Don't be tempted to use VARCHAR type fields for storing every string, though, because that has drawbacks, too. The MySQL server processes CHAR type fields much faster than VARCHAR type, for one thing, because their length is predetermined. If your strings don't vary in length much, you're better off using CHAR type fields. Moreover, when your strings are all the same length, VARCHAR takes up more disk space, because it has to store the length of each string in one additional byte.
A final note on VARCHAR/CHAR type fields: if your table has at least one VARCHAR field, all character fields are converted to VARCHAR type even if you define them otherwise.
Indexes and Keys
Inexperienced database designers sometimes complain about their database engines being slow, a problem that's often explained by the lack of an index. An index is a separate sorted list of a selected field (or fields) in a table. To explain why indexing a table has a dramatic effect on database performance, first consider a table without indexes. Such a table is basically the same as a plain text file because the database engine must search it sequentially. Rows in a relational database are not inserted in any particular order—the server inserts them in an arbitrary manner. To make sure it finds all entries matching the information you want, the engine must perform a full table scan, which is slow and inefficient, particularly if there are only a few matches.
Now consider an indexed table. Instead of moving straight to the table, the engine can scan the index for items that match your requirements. Because the index is a sorted list, this scan can be performed very quickly. The index guides the engine to the relevant matches on the database table, and a full table scan is not necessary.
So why not just sort the table itself? This might be practical if you knew that there was only one field on which you might want to search. However, this is rarely the case. Because it's not possible to sort a database by several fields at once, the best option is to use an index, which is separate from the table.
What about the case of searching multiple tables at once? This is where you really benefit from an index. Searching for a possible match across joined databases without indexes is a terrible idea—the engine would have to check all possible combinations of rows in one table with those in another. For two tables each with 500 rows, this would be 500 times 500, or 250,000 combinations! Indexing speeds searches up dramatically. The engine checks the index of the first table to find the position of matches to the first part of the query, and then it uses the index to the second table to find matches to the second part of the query. In other words, you pull out the relevant records directly.
A primary key is a special index that, as you saw at the beginning of the chapter, is used to ID records and to relate tables to one another, providing the relational database model. Each related table must have one (and only one) primary key.
Indexes and primary keys can be derived from combinations of fields. For a key to be formed in this way, the combination of items from each field must still be unique.
Because an index brings about a significant boost in performance, you could create as many indexes as possible for maximum performance gain, right? Not always. An index is a sure-fire way to increase the speed of searching and retrieving data from a database, but sacrifices performance when saving or updating records, and also increases the size of a table. Why? When you insert a record into an indexed table, the database engine has to record its position in the corresponding index table. Do the math!
What's more, if you have more than one index on a table, multiple write operations have to be performed on the index table, too. So when creating indexes on a table, don't create more than you need. Limit indexed columns to those that will be searched or sorted frequently. If required, you can create additional indexes on a table as you need them to increase performance.
Queries
SQL statements or commands are used to construct queries. Queries are the questions your application asks a database engine, which then returns the records that meet the criteria specified in the query. Queries return an array of records that meet the specified conditions, and contain information from selected fields. PHP, and other languages that support database connectivity for that matter, can treat the returned array as a normal array variable. We'll return to this topic later; for now, just bear it in mind. The returned array of records, called the result set, is the database engine's answer to your query. If you ask the engine to retrieve the records containing John as the first name, for example, it returns all of the records conforming to the query. If none are found, NULL (discussed in the next section) is returned.
Some SQL commands are literally commands that tell the database engine to do something instead of asking for an answer: "Delete those rows of information that contain John as the first name!" SQL commands of this type don't return a result set.
NULL
Consider the following scenarios:
A class takes a spelling test at school, but the teacher has yet to mark the test sheets and insert the results into a database table. It would be unfair of the teacher to insert default values into the results column before the sheets have been marked because a default score is irrelevant in this context. What could you place in the column to signify that you are awaiting data?
You want to construct a database table containing information about endangered bird species. One of the fields provides the fastest recorded flight speed for each bird in the table. You begin to create records for endangered penguins, and then you remember—penguins can't fly! What do you place in the flight speed column to indicate that the field is not applicable to penguins?
In both of these cases data is missing from a table. The only difference is that, in the first case, the situation is temporary because the teacher will soon add the missing test results to the table, but in the second case the values for the flight speed of penguins is impossible to obtain. You need a way to represent missing data in fields.
In a MySQL table, a NULL entry represents a missing value. NULL doesn't belong to any particular data type, but it can replace any value. Because it is not a data type or value, but it can be a field entry, the concept of a NULL is often difficult to grasp for beginners and experienced programmers alike. Programmers often have a mistaken idea of NULL. For example, a common mistake is to think of NULL as zero, which is wrong because zero is a value; NULL is not. Strings filled with one or more blank spaces, and strings of zero length, may also be mistaken for NULL because string is a data type, but NULL isn't. NULL is nothing, no data type, no value.
So what happens if the result set from one of your queries contains a NULL, and that result set is then used in your program in subsequent calculations? The rule of thumb for math with NULL is that it propagates. Any arithmetic operation involving a NULL returns NULL. This makes sense because how could you provide results when all the data needed to perform the calculation is not present? This also applies to dividing a NULL value by zero; NULL is returned.
Query Commands
MySQL queries issued to manipulate data in a table can be constructed using the following main commands:
- SELECT: Retrieves data from a database.
- DELETE: Deletes data from a database.
- INSERT—Inserts data into a database.
- REPLACE: Replaces data in a database. If the same record exists in a table, the command overwrites the record with the new data.
- UPDATE: Updates data in a table.
The rest of the command set involves creating or modifying the database structures, rather than the data stored in the databanse:
- CREATE: Creates a database, table or index.
- ALTER: Modifies the structure of a table.
- DROP: Wipes out a database or table.
You'll see more of these commands as you work through the next few chapters. Just to give you a taste though, let's take a look at the typical form of a MySQL SELECT query, which retrieves records from a table:
mysql> SELECT field1, field2, ... , fieldn FROM table WHERE condition;
The first thing to note is that each query is terminated with a semicolon, just as a PHP statement is. A query statement may expand to multiple lines. The following, slightly more specific query, is essentially the same as this generic case:
mysql> SELECT last_name, first_name
-> FROM user
-> WHERE first_name = 'John'
Take a closer look at the FROM, WHERE, and ORDER BY clauses in the query. The query returns any record from the user table where the value of the first_name field is John. Assuming you have a table called user, here's a sample of the query's output:
Simpleton John
Smith John
Thomas John
Now let's put this into practice and have some fun with MySQL!
