Page 3 of 4
Checking XML - DTD and Schema
If you invent an XML "standard" even for your own internal consumption sooner or later you are going to have to consider the need to check that an XML document conforms to a that standard.
For example, if you write a program to process the book document you might well make the assumption within a program that <Title></Title> will always be present. Your program will obvious fail if it is fed an XML document that doesn't conform to your standard. You could spend extra effort making your program robust against incorrect XML documents but a better method is to eliminate incorrect XML documents.
The problem here is that a general XML document only has to satisfy a small number of rules - basically for ever opening tag there is a correctly nested closing tag of the same type. What you need to do is specify exactly what the rules are for your particular type of XML document to be considered valid.
To specify what you consider a valid XML file you need to create a Document Type Definition (DTD) or an XML Schema.
In either case what you are doing is specifying the grammar that the XML document will use. Given an XML document and an appropriate DTD or Schema it is possible to check that the document is valid, i.e. conforms to the grammar, and then go on to process it, secure in the knowledge that it is as described in the DTD/Schema.
The idea of specifying a grammar for XML documents that are intended for a particular purpose is an integral part of XML-based technologies. Any application that makes use of one of these XML customizations can work safe in the knowledge that if it has been designed to work with an XML document that is constructed according to the DTD/Schema then it will be able to process it.
This is another example of the way that XML makes it possible for generators and consumers of XML documents to be constructed in isolation. The grammar of XML ensures that they will work together even though they might have never met.
For example, most web browsers already know how to do a reasonable job of displaying XML. Programming languages provided the basic tools to allow you to write programs that process XML and so on. There is even a standard way of querying an XML document to retrieve the data between specific tags. The whole idea is that once we have picked a way of doing things we don’t have to start from scratch every time we need a new XML dialect to describe something.
To explain how schema or DTDs work would require an in depth tutorial and unless you really are going to invent your own standard you really don't need to know this much. In most cases you will be using standards specified by other people and they will, or should have provided a DTD/schema that you can just use.
if you do decided that you need to implement a DTD or a schema for your own XML standard then you need to know that the schema is the newer technology and it is to be preferred. So if you want to look up the details of XML schema rather than how to create a DTD.
Related XML Technologies
The basic idea of XML, even if you include syntax checking using DTDs. or schema is very simple. What makes the XML world seem much more complicated is the number of fairly core technologies that are not only based on XML but almost an integral part of making use of it.
You are bound to have encountered jargon such as XSLT, XPath, XLink and so on. Even technologies such as SOAP, which don’t have an X in their names, are in fact related to XML in just this simple way. That is, they are special customizations of the basic XML technology to particular purposes. Each one comes with its own specific grammar as defined in a DTD or Schema file.
Although there are far too many XML-based technologies to cover all of them, there are a few that are so important you need to know a little more than just their names.
XML-compatible XHTML is now more or less superseded by HTML5. The idea was that a version of HTML implemented in strict XML complete with a schema would mean that a browser could check that a page was correctly formatted before even attempting to render it. However the big problem with XHTML was that it didn't include HTML as a sub-set and if you think about it how could it. This was too much of an upheaval for most web page constructors and it never really caught on.
The time and effort of the W3C standards committee was being wasted on XHTML and not on pushing forward in the direction everyone seemed to want to go with a new and better HTML. The log jam was broken when a new group - WHATWG - decided to improve HTML to create HTML5 and the rest is history. XHTML is something you need to know about and it is supported by most browsers but it isn't the way of the future no matter how good the idea is.
SOAP and web services
SOAP is an XML-based specification for how one program can make use of another via the web. SOAP defines how data will be packaged and transferred between the two programs and hence allows interoperation. Among other XML technologies related to web services, UDDI allows the automatic discovery of web services, think of it as a web services “yellow pages”, and WSDL describes how to use a particular web service.
SOAP was invented by Microsoft and heavily promoted as part of the .NET framework. However it was generally considered to be over complex and bloated. It provided a very complicated way of getting something that could also be complicated done.
Over time the alternative approach to web services - that is REST - became more popular because it was technology neutral and attacked the problem head on. REST uses custom URLs to organize the way a web service is used. A REST based web service might well return its data in XML format or using JSON
XQuery is the XML standard for constructing database queries. Of course XML is a natural choice for the output of a database if it is going to be further processed and Microsoft SQL Server for example supports XML results. There are also native XML databases that use XML as the basic unit of storage.
XPath, SAX, DOM, XPointer and XSLT
There are a number of XML standards that are specifically only of interest to developers.They are what we used within programs to work with XML. In general as well as being XML standards they are usually supported within particular languages with frameworks that allow you to load, parse and generally manipulate an XML document.
XPath can be thought of as a generalization of an HTML style URL. It enables you to specify exact locations within an XML document. So for example if you wanted to specify the tag <C> nested within <A> and <B> you would use /A/B/C. It is essentially a notation for specifying a location within a tree structure.
XSLT is about processing XML documents but it specifies automatic transformations to be applied. So, for example, you can create an XSLT document that converts an XHTML page into a standard HTML page.
XPointer is just the XML version of what we all know as a URL i.e. it is used to link XML documents together.
SAX and DOM are two APIs to allow you to process a XML document.
SAX works by reading in the document and parsing it as it does so. It then calls functions when it finds particular tags. Essentially it parses the XML to a tree and calls functions when it encounters particular node. Its big advantage is that it doesn't have to represent the entire XML document in memory so it can be fast and efficient. However it isn't so good if you need to access nodes of the tree in random order.
DOM on the other hand read in the entire XML document and constructs a Document Object Model using it. The XML DOM is very similar to the HTML DOM and if you know how to use on the other follows. Essentially you have an object for each node of the tree and you can traverse the object hierarchy using supplied functions. The big advantage of the DOM is that it can be manipulated in complex ways and even used to generate a new XML file after you have finished.