raymond yu web masthead


Monday, July 13, 2020

1st September, 1998

XML or eXtensible Markup Language is said to be the next major standard on the web. It promises to transform the usage, delivery, and access of information on the web.

What is XML?

XML is an extensible markup language because it is not a fixed format like HTML.

XML is not just a markup language. It is a metalanguage. It sets up a framework that enables anyone to design his or her own markup language. So it is possible to create a markup language for chemistry, mathematics, engineering, or even one for your collection of cooking recipes.

It brings to the web the strength and the advance features of Standard Generalized Markup Language (SGML). XML is designed to enable generic SGML to be served, received, and processed on the Web, and at the same time ensured that it is as easily accessible to everyone as HTML has been.

It Ain't Broken, why fix it?

Well, the fact is HTML has reached its limits. To enable the web to evolve further something more is required.

The Problems

1. Presentation vs semantic. The original intention of HTML is to markup the information in a document according to its meaning. HTML tags are supposed to refer to the nature of the information contain within them. So tags like <TITLE></TITLE> and <ADDRESS></ADDRESS> are supposed to refer to the title and address respectively. The instruction about the layout or appearance should be external to the HTML document using techniques like style sheets.

Somehow, in the rush to promote web surfing, to develop HTML browsers, and to gain market shares this original intention has been lost. Browsers vendors, notably Netscape and Microsoft, introduced exotic tags in each new versions of their browsers. Not only are these tags often incompatible with rival browsers (until they catch up), they focus more on the presentation rather than the meaning. For example, tags like <FONT></FONT>, <B></B> and <CENTER></CENTER> say nothing about the meaning.

2. No internal structure. The absence of proper structure in HTML means that perfectly valid HTML document may not make sense at all just by looking at the semantics of the elements. For example, because the elements in the contents of BODY can be put in any order the author want, the <H3> level heading can be placed before a <H2> level heading.

People use the Heading tags as short hand for a particular font size and style. The Heading tags are supposed to denote logical structure of the content. A <H3> level heading being a sub-heading to the <H2> level heading, ought to be place after a <H2> level heading.

3. Inflexibility. HTML is not design to grow. For example, to markup information really precisely according to its meaning lots of elements are needed. These just are not present in HTML, nor is it physically or economically possible to have tags that cover all possibilities. Tags needed by Mathematician to present equation are different to the ones needed by a musician to display musical scores.

The "one size fit all" premise is not adequate. It fails to enable the language to adapt to the needs of different content providers.

At the moment, for example, there is no easy way to markup mathematical equation under HTML. Equations can only be displayed by first converting them into graphical format. This is an inefficient and time-consuming process. Importantly, the graphics are static; they cannot be change easily, let alone be searched.

Further because the markup for layout, appearance and linking all get mixed together with the information it is not easy to make changes. Changing one thing often leads to a change in everything.

The bottom line is that under HTML the ability to handle information or semantic is sacrificed for the appearance of the information. HTML realises only a small fraction of what a hypertext system can do.

What XML Promises

As the emphasis shift back to the semantic or the information aspect, structure becomes more important. Structure gives the document a sense of logic.

This provides the basis for meaning and interpretation not just for the surfers but for the browser. Structure gives browser the intelligence necessary for more powerful information handling and manipulation.

Currently, browser is dump. It do not understanding the content except to the extend that it knows certain block of text is a paragraph. But it does not know what the text or picture is about.

Under XML such understanding is possible. For example, look at the following XML code:

<?xml version="1.0" standalone="yes"?>


<interviewer>What the outlook for Apple Computer?</interviewer>

<interviewee><stevejobs>As you can see the profit for this quarter is much better than expectation….The <aNewModel>iMac</aNewModel> is also expected to do well.</stevejobs></interviewee>


The basic structure looks just like HTML.

Notice, unlike HTML, XML enables you to define your own markup language. This permits you to encode the information in much more precise ways than is possible with HTML.

In the above example, the whole thing is a transcript of an interview. The question and answer is enclosed by the <interviewer></interviewer> and <interviewee></interviewee> tags respectively.

This is much more meaningful than the HTML tag of <P></P>. The markup shows what the information is about. In fact, the XML code together with the text provides more information that the text alone.

There is no limit to the degree of details that can be provided. The response of the interviewee in the example is also bounded by the tag <stevejobs></stevejobs>, indicating that Steve Jobs is the interviewee making the statement. Other tags can be added to describe who this guy is, whether the comment is positive or negative, etc.

The main advantage is that this provides a means whereby programs processing these documents can "understand" the information. This enables it to process the information in a more "intelligent" ways. For instance, from a full interview transcript markup in the above manner, it is possible to construct a program to extract only the comments made by Steve Jobs in relation to the new computer, iMac.

Information markup in such manner can be searched and processed in ways that is similar to a relational database. Just think of the possibility that this provides. Surfing the web for information will not be the same again.


XML brings back the original intention of HTML, where the appearance of the document is defined externally by things like style sheet. It would probably be similar in nature to the Cascading Style Sheet (CSS) in used now.

The main advantage of this approach is ease of maintenance. Firstly, because the markup is much "cleaner". Details about the document appearance are no longer mixed in with the substance of the document. Updating the contents are made much easier. Secondly,

stylesheet provides much better control of the page appearance. Uniform site appearance is more easily achievable.


Linking under HTML is primitive compare to what is possible under XML.

Broadly, the following would be possible:

Searches and Manipulations

Because XML enables information to be precisely described by the markup, it is possible to search them in much better and more powerful ways than the primitive text searches currently available. There are currently SGML query languages that are similar to SQL in power. XML will also acquire some of these capabilities. With more research and development still underway, the whole web can become a huge relational database.

Because of the structure of the document and the much richer details contained in the tags, information can be manipulated much easily. Even mathematical equations can be searched and rearranged on the web. Indeed, with some programming it is even possible to create a program to produce the answer when the terms are substituted into the equation.

Is XML Hard to Use?

Not at all. After all ease of use is one of its design goal.

As seen in the above example, the structure of XML is very similar to HTML.

Because XML permits you to invent your own tags, you need to tell the browser what the tags mean. This is done though the Document Type Definition (DTD). (HTML has DTD too. The difference is that the DTD for HTML is built into the browser.)

DTD is technical stuff. For instance, DTD codes could look something like this:

<!ELEMENT item (#pcdata)>

<!ELEMENT list (item)+>

The DTD is usually a file (or several files used together) which contains a formal definition of a particular type of document. This sets out the name of the elements or tags, where they may occur, and how they relate to one another. It provides the information the processors need to automatically parse a document and determine how the page should be rendered.

Writing DTD can be a rather complex task. Obviously, not everyone is prepared to do this. As a result, XML has been designed so that it can be used either with or without a DTD. Yes, this means your invented markup would still work even without defining it formally.

Effectively, a DTDless file 'defines' its own markup by the existence and location of elements where you create them. This is only possible provided that the document is "well-formed". This basically means that the structure of the document, including the tags, must conform to a particular standard. This is a much stricter standard than that under HTML.

Future Look Bright

XML may not look simple when you see things like DTD, but given that it possesses the strengths of SGML and yet maintain the ease of use of HTML, it is a remarkably development.

Compare to HTML, it may be much harder to code XML by hand, but the power offered by XML more than justify the extra effort involved. Anyway, authoring programs will no doubt come out to take much of the technicality away, leaving you with the means to access the power of XML and take the provision and access of information on the web to new height.

A good thing is that Netscape and Microsoft have already pledged their support for XML. Indeed, it is even rumoured that the next version of both browsers will be XML compliant and will incorporate some XML features.

Full implementation of XML is some way off. XML is still under development, especially the linking and stylesheet aspect. However, it is definitely something that should be watched out for.





[back to index]