Programmer's Introduction to XML
Written by Ian Elliot   
Thursday, 27 January 2022
Article Index
Programmer's Introduction to XML
Attributes & Checking XML
A Glossary

eXtensible ML?

To make XML fully practical, however, we do need some additional features. The first is that XML tags can be a little more complicated than just a name.

You can include attributes within a tag to “describe” that data that follows. Attributes are also part of HTML but in this case what you can use is mostly determined by the HTML standard.

In XML you can invent your own attributes.

For example, you could decide to include a rating for the book in the <Book> tag:

<Book rating=5>

This inclusion of information within the tag as well as data between the tags has lots of advantages. It allows you to make a distinction between the data and additional information or meta-data concerning the real data.

For example in this case it allows you to select which <Book> records to process without altering the data in the body of the record.

Of course it is a matter of opinion whether or not something should be an attribute or just another pair of tags.

For example, the rating could be supplied as:

 <Rating>5</Rating>

It can be difficult to decide when to invent a new tag or when to use an attribute but generally:

An element is used when:

  • the content is long or variable in length

  • order matters (attribute order is ignored)

  • the information really is “content"

An attribute is used when:

  • the information modifies the element in a way that isn’t naturally part of the content

  • when you want to restrict the possible values used

  • the information is obviously meta-data.

The ability to invent new tags and attributes is what makes XML extensible.

XML Based Standards

As XML is so extensible you are free to invent your own ways of describing data. However there are big advantages in agreeing to use the same XML tags and attributes to describe the same sort of data.

For example you could sit down tomorrow and invent an XML dialect that describes books in minute detail. You could call your new use for XML XBooks or something similar and announce it to the world.

Of course unless you happen to be a big company or someone with lots of influence the chances that anyone would use your dialect as a standard is quite small.

However, there are lots of standard forms of XML that we have managed to agree on and you can find a list of some of the best known at the end of the article. There is also the phenomenon of "microformats" small chunks of data such as names and addresses stored in obvious and easy to read XML.

The advantages of using a standard form is that we can pool our data. If I enter some data using XBooks and you enter some data using XBooks then, provided we really do use the same format, merging our data should be easy.

However freedom to invent your own XML means that you can deviate form any standard either on purpose or by accident and this spoils the chance to share data.

Banner

Checking XML - DTD and Schema

If you invent an XML "standard" even for your own internal consumption sooner or later you are going to have to consider the need to check that an XML document conforms to a that standard.

For example, if you write a program to process the book document you might well make the assumption within a program that <Title></Title> will always be present. Your program will obvious fail if it is fed an XML document that doesn't conform to your standard. You could spend extra effort making your program robust against incorrect XML documents but a better method is to eliminate incorrect XML documents.

The problem here is that a general XML document only has to satisfy a small number of rules - basically for ever opening tag there is a correctly nested closing tag of the same type. What you need to do is specify exactly what the rules are for your particular type of XML document to be considered valid.

To specify what you consider a valid XML file you need to create a Document Type Definition (DTD) or an XML Schema.

In either case what you are doing is specifying the grammar that the XML document will use. Given an XML document and an appropriate DTD or Schema it is possible to check that the document is valid, i.e. conforms to the grammar, and then go on to process it, secure in the knowledge that it is as described in the DTD/Schema.

The idea of specifying a grammar for XML documents that are intended for a particular purpose is an integral part of XML-based technologies. Any application that makes use of one of these XML customizations can work safe in the knowledge that if it has been designed to work with an XML document that is constructed according to the DTD/Schema then it will be able to process it.

This is another example of the way that XML makes it possible for generators and consumers of XML documents to be constructed in isolation. The grammar of XML ensures that they will work together even though they might have never met.

For example, most web browsers already know how to do a reasonable job of displaying XML. Programming languages provided the basic tools to allow you to write programs that process XML and so on. There is even a standard way of querying an XML document to retrieve the data between specific tags. The whole idea is that once we have picked a way of doing things we don’t have to start from scratch every time we need a new XML dialect to describe something.

To explain how schema or DTDs work would require an in depth tutorial and unless you really are going to invent your own standard you really don't need to know this much. In most cases you will be using standards specified by other people and they will, or should have provided a DTD/schema that you can just use.

if you do decided that you need to implement a DTD or a schema for your own XML standard then you need to know that the schema is the newer technology and it is to be preferred. So if you want to look up the details of XML schema rather than how to create a DTD.

Related XML Technologies

The basic idea of XML, even if you include syntax checking using DTDs. or schema is very simple. What makes the XML world seem much more complicated is the number of fairly core technologies that are not only based on XML but almost an integral part of making use of it.

You are bound to have encountered jargon such as XSLT, XPath, XLink and so on. Even technologies such as SOAP, which don’t have an X in their names, are in fact related to XML in just this simple way. That is, they are special customizations of the basic XML technology to particular purposes. Each one comes with its own specific grammar as defined in a DTD or Schema file.

Although there are far too many XML-based technologies to cover all of them, there are a few that are so important you need to know a little more than just their names.

XHTML

XML-compatible XHTML is now more or less superseded by HTML5. The idea was that a version of HTML implemented in strict XML complete with a schema would mean that a browser could check that a page was correctly formatted before even attempting to render it. However the big problem with XHTML was that it didn't include HTML as a sub-set and if you think about it how could it. This was too much of an upheaval for most web page constructors and it never really caught on. The standard is HTML 5 and XHTML is almost forgotten.

SOAP and web services

SOAP is an XML-based specification for how one program can make use of another via the web. SOAP defines how data will be packaged and transferred between the two programs and hence allows interoperation. Among other XML technologies related to web services, UDDI allows the automatic discovery of web services, think of it as a web services “yellow pages”, and WSDL describes how to use a particular web service.

SOAP was invented by Microsoft and heavily promoted as part of the .NET framework. However it was generally considered to be over complex and bloated. It provided a very complicated way of getting something that could also be complicated done.

Over time the alternative approach to web services - that is REST - became more popular because it was technology neutral and attacked the problem head on. REST uses custom URLs to organize the way a web service is used. A REST based web service might well return its data in XML format or using JSON

XQuery

XQuery is the XML standard for constructing database queries. Of course XML is a natural choice for the output of a database if it is going to be further processed and Microsoft SQL Server for example supports XML results. There are also native XML databases that use XML as the basic unit of storage.

XPath, SAX, DOM, XPointer and XSLT

There are a number of XML standards that are specifically only of interest to developers.They are what we used within programs to work with XML. In general as well as being XML standards they are usually supported within particular languages with frameworks that allow you to load, parse and generally manipulate an XML document.

XPath can be thought of as a generalization of an HTML style URL. It enables you to specify exact locations within an XML document. So for example if you wanted to specify the tag <C> nested within <A> and <B> you would use /A/B/C. It is essentially a notation for specifying a location within a tree structure.

 XSLT is about processing XML documents but it specifies automatic transformations to be applied. So, for example, you can create an XSLT document that converts an XHTML page into a standard HTML page.

XPointer is just the XML version of what we all know as a URL i.e. it is used to link XML documents together.

SAX and DOM are two APIs to allow you to process a XML document. 

SAX works by reading in the document and parsing it as it does so. It then calls functions when it finds particular tags. Essentially it parses the XML to a tree and calls functions when it encounters particular node. Its big advantage is that it doesn't have to represent the entire XML document in memory so it can be fast and efficient. However it isn't so good if you need to access nodes of the tree in random order.

DOM on the other hand read in the entire XML document and constructs a Document Object Model using it. The XML DOM is very similar to the HTML DOM and if you know how to use on the other follows. Essentially you have an object for each node of the tree and you can traverse the object hierarchy using supplied functions. The big advantage of the DOM is that it can be manipulated in complex ways and even used to generate a new XML file after you have finished. 

<ASIN:0132886723>

<ASIN:0321559673>

<ASIN: 1840783370>

<ASIN: 0596007647>

<ASIN:1118162137>



Last Updated ( Friday, 28 January 2022 )