There are two terms you hear over and over when discussing XML: well-formed and valid. A well-formed XML document follows the basic syntax rules (to be discussed in a minute), and a valid document also follows the rules imposed by a DTD or an XML Schema.
Being well-formed is the most basic requirement for XML documents; one that is not well-formed is not really an XML document. It's kind of like a script that someone tried to write in PHP but which contains fatal syntax errors; yes, it looks like PHP, but it really isn't until all the syntax errors are removed. A well-formed XML document may contain any elements, attributes, or other constructs allowed by the XML specification, but there are no rules about what the names of those elements and attributes can be (other than the basic naming rules, which are really not much of a restriction) or about what their content can be. It is in this extensibility that XML really derives a lot of its power and usefulness; so long as you follow the basic rules of the XML spec, there's no limit to what you can add or change.
A well-formed document does not need to be valid, but a valid document must be well formed, otherwise it couldn't be read in the first place. If a document is well formed, and it contains a reference to a DTD or XML Schema, your XML parser has the opportunity to reference the DTD or Schema and determine whether the document is valid. An XML document is valid if the elements, attributes, and so on that it contains follow the rules in the DTD or Schema. By definition then, the DTD or Schema contains rules about what elements or attributes may be contained in the document, what data those elements and attributes are allowed to have, and so on. In fact, the whole purpose of having a DTD or Schema is to define exactly what elements and attributes are allowed, and exactly what data they can contain.
Although referencing a DTD or Schema limits the name/value pairs (elements, attributes, and many of the other XML constructs) you may have in your XML document, the big advantage is that applications that know nothing about each other can still communicate effectively when they share the capability to parse XML because they can both read a well-formed document, and understand its contents if it is valid. Being readable by either humans or machines, and, by virtue of a DTD or XML Schema, knowing specifically what the elements and attributes mean, is another feature that makes XML so powerful.
Major Parts of an XML Document
An XML document may contain an optional prolog, and then the mandatory root element (including any content and other elements, attributes, and so on), with an optional section at the end for other data. The following list identifies the requirements within these major sections.
- XML documents should contain an xml version line, possibly including a character encoding declaration.
- Valid XML documents contain a DTD or an XML Schema, or a reference to one of these if they are stored externally.
- XML documents usually contain one or more elements, each of which may have one or more attributes. Elements can contain other elements or data between their beginning and ending tags, or they may be empty.
- XML documents may contain additional components such as processing instructions (PIs) that provide machine instructions for particular applications; CDATA sections, which may contain special characters, such as those found in scripts that are not allowed in ordinary XML data; notations, comments, entity references (aliases for entities such as special characters), text, and entities.
Here's an example of an XML document that is both well formed and valid:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE Client SYSTEM "http://www.example.com/dtds/client.dtd">
<clients>
<client>Joe</client>
<client>Jim</client>
</clients>
Notice the reference to an external DTD located at the URL specified. This means that the document can be validated by reading the DTD and then checking the document to make sure that it conforms to the DTD. Of course, you could manually read through the document and compare it with the elements, attributes, and other document components specified in the DTD, but there are many applications available that can automatically validate an XML document against a DTD or an XML schema. And because the DTD or Schema is available either directly in the document or online, it's easy for these applications to perform the validation function for you automatically as they parse the document.
Well-Formed XML Documents
A well-formed XML document follows the XML specification syntax. The syntax, of course, follows some basic rules, the most common of which are listed here:
- There is only one parent element containing all the rest of the elements in the document.
- XML documents should (but are not strictly required to) begin with an XML declaration that gives the XML version number being used. For example:
<xml version="1.0"> - Character encoding declarations may be included with the XML version line, and must be included for encodings other than UTF-8 or UTF-16. The code might look like this:
<xml version="1.0" encoding="UTF-8"> - If an XML document contains a DTD or a reference to an XML Schema that must appear before the first element in the document.
- XML elements can be made from start and end tags (written much like HTML tags) or can be a single tag with a terminator (like the <br/> tag in XHTML). Unlike HTML, there is no allowance for elements that have only a starting tag and are not self-terminated. Elements with start and end tags are considered non-empty (meaning they can contain content) whereas empty tags do not contain content (empty elements sometimes signify something on their own, like the <br/> tag in XHTML), like this:
- XML attributes can be written inside non-empty XML elements, and must have a name and a value; the value must be enclosed in delimiters, such as double quotes. No attribute name may appear more than once inside a single element.
- XML elements must be properly nested, meaning any given element's start and end tags must be outside the start and end tags of elements inside it, and inside the start and end tags of elements outside it. Here's an example:
//not good <parent><child></parent></child> //good <parent>child></child></parent> - CDATA sections (sections of data that make up scripts, for example) must be delimited by [CDATA and ]].
- XML elements may not be named using "xml," "XML," or any upper- or lower-case combination of these characters in this sequence. Names must start with a letter, an underscore, or the colon, but in practice, you should never use colons. Names are case-sensitive. Numbers, the hyphen, and the period are valid characters to use after the first character.
- Comments are delimited like HTML comments (<!-- and -->).
Using XML Elements and Attributes
XML elements and their attributes form the hierarchical structure of an XML document, and contain its content (the content of an XML document is its data). Although there can be only one root element, the root element may contain multiple elements of the same name (often referred to as child elements), and child elements can also contain multiple elements of the same name. So you might have a document like this:
<clients>
<client ID="1">Joe</client>
<orders>
<order ID=°1">ProductA</order>
<order ID="2">ProductB</order>
</orders>
<client ID="2">Jim</client>
<orders>
<order ID="1">ProductA</order>
<order ID="2">ProductB</order>
</orders>
<client>
</clients>
As you can see, part of the content of this document is put into elements (the name of each client is between the beginning and ending client elements) and part of the content is the value of attributes (the ID numbers of the clients and their orders are specified in the ID attributes of the client and order elements).
There is some controversy about when to use an attribute and when to use an element for containing data. Although there is no hard and fast rule, a good rule of thumb is to use an element when there is the possibility that you might need to specify the same thing more than once (for example, although you may only have one order at present, you can expect there will be more orders for a single client), and when you're sure the data will only occur once (for example, each client may have one, and only one, ID number), use an attribute.
Valid XML Documents: DTDs and XML Schemas
DTDs are special documents written in Extended Backus Naur Format (EBNF), which is not an XML language and therefore isn't so easy to parse. DTDs specify constraints on XML elements, attributes, content, and more. XML Schemas serve the same purpose, but are written in the XML Schema language, and can easily be parsed and processed using the same application that was used to read the XML document. XML Schemas are also much more capable than DTDs for defining detail in your elements and attributes (such as data type, range of values, and so forth) and are therefore preferred over DTDs by many XML authors. Both can be referenced in the XML document before the first element, and both have other means of being included within an XML document (you'll see how in just a bit).
If a DTD or schema is present or is referenced in an XML document, some or all of the elements and content of the document may be validated against the DTD or schema. The primary added value of a validated XML document is that the processing application "knows" something about the content of the document, such as how many times a given element may appear within another element, what values an attribute may assume, and so on.
As mentioned previously, anyone can author an XML document, and anyone can define a DTD or XML Schema against which to validate an XML document. This being the case, the World Wide Web Consortium has made the next version of HTML into XHTML, using the existing DTD for HTML (yes, HTML has always been based on a formal DTD), with very small modifications, as the definition of all the elements, attributes, and other components allowed in an XHTML document. The main difference between HTML and XHTML is the fact that an XHTML document must conform to the XML specification, whereas HTML documents are not required to do so.
Complicating things further, browsers will display HTML documents even if they are not well-formed HTML, let alone well-formed XHTML. But browsers will display XHTML documents as XML if the file extension is .xml, and as regular Web pages if the file extension is .htm or .html. Of course, to display an XHTML document as a regular Web page, the reference to the XHTML DTD must be valid, and the document must be well formed. In the next few sections you'll examine a portion of the DTD for XHTML, show how the DTD can be referenced in an XHTML document, and see how it displays in the browser when the file extension is .xml and when it is .htm.
The DTD for XHTML
There are three DTDs for XHTMl,. They're located at:
These three DTDs complement their HTML counterparts, and are, in fact, quite similar. If you enter these links in your browser, you'll actually see the DTD in plain text.
Here is some code showing how a DTD (the strict version) is written for the XHTML language, but just for the image (IMG) element. The DTD for HTML is shared with XHTML (with very small differences to ensure that XHTML documents conform to the XML spec), although only XHTML actually conforms to the XML specification. What this means is that you'll find all the HTML elements and attributes present in XHTML, but if you use them in an XHTML document you must conform strictly to the rules imposed by XML (such as proper nesting and termination of elements).
<! --
To avoid accessibility problems for people who aren't
able to see the image, you should provide a text
description using the alt and longdesc attributes.
In addition, avoid the use of server-side image maps.
Note that in this DTD there is no name attribute. That
is only available in the transitional and frameset DTD.
-->
<!ELEMENT img EMPTY>
<!ATTLIST img
%attrs;
src %URI; #REQUIRED
alt %Text; #REQUIRED
longdesc %URI; #IMPLIED
height %Length; #IMPLIED
width %Length; #IMPLIED
usemap %URI; #IMPLIED
ismap (ismap) #IMPLIED
>
<!-- usemap points to a map element which may be in this document
or an external document, although the latter is not widely supported -->
In this example (keeping in mind that it is written in EBNF) you can see that on the first line following the comment, there is a callout for ELEMENT, and the name of the element is img, and it is EMPTY (contains no content between the non-existent beginning and ending tags). However, even though it is formally empty, its src attributes does contain data in the form of a URI (for our purposes the same as a URL) that specifies where the image file can be found.
Following the ELEMENT callout is a list of attributes that may be included with the img tag in an XHTML document. Those of you familiar with HTML and XHTML no doubt recognize the src attribute as the URL (or URI) that specifies the location of the image file and is REQUIRED.
So this portion of the DTD for XHTML documents specifies that it is permissible to include the IMG element in such documents. If this DTD is referenced in an XHTML document (the entire DTD, not just this portion), and the document includes the img element with an appropriate src attribute, then the document could be said to be valid (at least as far as the img element is concerned). However, if you tried to include an element name imge or image or images, a validating XML parser would produce an error, because according to the DTD such elements are not defined, and therefore the document is not valid. And note that although the img element does not need to be terminated in an HTML document, it must be properly terminated in an XHTML document.
Referencing DTDs and XML Schemas
To validate an XML document, there needs to be a either a reference to an external file containing the DTD or XML Schema, or the DTD or schema must be included with the XML document. Referencing XML Schemas is slightly more complex, so first take a look at how DTDs are referenced.
To reference an external DTD, a DOCTYPE declaration is used. The DOCTYPE declaration provides some information regarding how to locate the DTD and what its name is. For example, this line shows how a DTD is referenced using a URL:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtmll/DTD/xhtmll-strict.dtd">
The html after the <!DOCTYPE in the first line signifies that the root element is named html, and is required. If the DTD is an external document, it can be located anywhere, and identified by any URI (Uniform Resource Locator) that the application reading it understands and has access to, not just a URL over the Internet.
A big limitation of DTDs is that only one external DTD can be referenced in a document, and there is no DTD support for adding prefixes to element or attribute names anyway, although you can call out a namespace for a document that references a DTD.
A namespace is a definition of the source of names for elements and attributes (so far as XML is concerned). Designating the source of an element or attribute name means that you can use the same name to represent different things within a single document. A namespace can be identified within an XML document by referencing it via a special reserved XML keyword, the xmlns (XML Namespace) attribute, like this:
xmlns = "http://www.w3.org/1999/xhtml"
This URL is the official namespace of XHTML. The element and attributes names for this namespace are defined within the XHTML DTD, and the xmlns attribute serves only to define the namespace for the root element of the XHTML document (the root element is html). Defining the namespace for the root element in this manner also serves to define the namespace for all the rest of the elements and attributes in the document.
External XML Schemas
You can reference an XML Schema by referencing the location of the XML Schema document with a URI. Typically, this is written into the XML document by putting the xmlns attribute as part of the root element and setting it to the location of the schema so that the namespace is defined and the parser also knows where to look for the XML Schema.
To reference an XML Schema, an xmlns attribute may be added to the root element of the document, as shown here:
<?xml version="1.0" encoding="UTF-8"?> <customer xmlns="http://www.example.com/customer.xsd" cust_id="1"> <cust_name>John Doe</cust_name> </customer>
Of course, this implies that you have already written the XML Schema document that defines the customer (and its cust_id attribute) and cust_name elements, named this document customer.xsd, and placed the document in the root folder of the http://www.example.com Web site. Although this book won't get into the details of writing an XML Schema, suffice it to say that XML Schema is a much richer language for specifying elements, attributes, and other components of an XML conforming language, and because it is written according to the guidelines of the XML specification, it is easier to process as well.
For documents that can be validated against an XML Schema, any number of namespaces can be declared using the xmlns attribute, each associated with an external XML Schema. For example, you might have one XML Schema for which the element farm means an area used for agricultural purposes, and another for which the element farm means a number of server computers all performing the same task. If you want to create an XML document that uses both elements (for example, describing how the farm manages its IT) you need some way to distinguish between the two.
Because both XML Schemas can be referenced in a single document, you can use the xmlns attribute to identify them by URL, and you can create prefixes that can precede any element names from either one. For example, you might use code such as the following to do this:
xmlns:agri = "http://www.example.com/agricultural.xml"
xmlns:serv = "http://www.example.com/server.xml"
Thereafter, any element preceded by agri: would be defined by the agricultural schema, and any element preceded by serv: would be defined by the server schema. This prevents confusion about the meaning of these elements.
Writing an XML Document with XHTML
For an XHTML document, there is also a requirement to specify a namespace. Although DTDs don't lend themselves to multiple references, you can still specify one namespace, and the XHTML spec makes this a requirement.
To write an XHTML document, start by indicating the version of XML you're using, provide a DOCTYPE declaration referencing the XHTML DTD, and then insert the xmlns attribute indicating the namespace of the document (inserting the xmlns attribute into the root element makes the root element defined by the DTD, and by default all of its child elements as well). Here's an example:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtmll/DTD/xhtmll-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <HEAD> <title>An xhtml example</title> </head> <body> <p>This document is an example of an xhtml document. It can contain images <img src="http://www.example.com/images/image.gif " /> as well as links <a href="http://example.com/">example.com</a> and any other html elements. </p> </body> </html>
Of course, this document looks very much like an ordinary HTML document, and will be displayed just like any Web page written in HTML in most browsers if you save it with the extension .htm or .html, but it conforms to the XML specification, and is not only well formed but also valid. (If you save it with the extension .xml, it will be displayed in XML format by Internet Explorer.)
Web Services
Another example of the power of XML is found in the design of Web services. Web Service is the name given to a unit of programmed logic that is available across the Internet, and the name XML Web Service is applied when the Web Service is accessible via XML languages for accessing such services.
So how do Web services work, and why are they valuable? Consider how you define and then call a function in a PHP program. First you write the function, giving it a name and parameter list, and adding all the processing logic required for it to do its job. Then, you can call it and expect it to perform just by naming its name and passing the appropriate parameters.
That's great, but suppose you could do the same thing across the Internet, accessing predefined functions (and thereby other data stores, including databases) by simply identifying them by their URL and name, and passing the appropriate parameters. This would mean you could build an application that theoretically is distributed (meaning it doesn't matter where the programming logic is coded or the data is stored) anywhere across the Internet.
And that is exactly what you can do with Web services. Calling a function that someone else coded, using someone else's database, or even multiple functions and multiple databases, anywhere across the Internet is what Web services are for. But you need a little bit of specialized help to access Web services, because they may run from any platform, using any language and any database, and there are some translation issues. That's where SOAP and WSDL come in:
- Simple Object Access Protocol (SOAP) is an XML language that provides for defining an envelope, body, and other parts to send and receive Web Service calls. You insert your Web services calls in a SOAP envelope.
- Web Service Description Language (WSDL) is another XML language that is used to define the name, type, and arguments associated with a call to a Web Service.
Although Web services are one of the most important uses for XML, and there are quite a few Web services-related applications available that make it easy to develop both PHP Web services and the client code that calls them, the subject is beyond the scope of this book. Please see Wrox's Professional PHP Development for a great deal of interesting information on this t
