The purpose of this article is to introduce some technical XML principles.
- Understand the technical structure of an XML document
- Understand well-formed documents
- Understand valid documents
- Some idea what XML is about
- Tour de XML (optional)
- Next steps
The two most fundamental ideas you should start believing are the following:
- XML does not exist as a language. It is just a formalism for creating languages
- XML can be used to describe almost any kind of structure
Below, we shall introduce the technical barebones of XML.
2 Structure of an XML document
An XML document usually includes:
- Processing instructions (at least an XML declaration on top !)
- Declarations, in particular a Document Schema lik a DTD
- Element markup: content delimited by tags like
<my_tag>contents</my_tag>or tags without contents like
- Attribute markup like
stylewould be the attribute and
"green">the attribute value.
- Entities (i.e. symbols that are substituted by other contents at runtime)
- comments <!-- .... -->
XML documents are trees
For a computer person, an XML document has a so-called tree structure. We also call it “boxes within boxes”. Inside a browser or most other clients, the document is represented as a tree-based data structure, the so-called Document Object Model (DOM)
Below is an XML fragment for a CALS (Docbook) table example:
<?xml version="1.0"?> <TABLE> <TBODY> <TR> <TD>Pierre Muller</TD> <TD>http://pm.com/</TD> </TR> <TR> <TD>Elisabeth Dupont</TD> <TD></TD> </TR> </TBODY> </TABLE>
A graphical representation of this tree looks like this:
If we look at this as "boxes within boxes", there is a
TABLE box that includes a
TBODY box. The
TBODY box includes two
TR boxes. Each of the latter include two
TD boxes. In the XML markup language, each tree element or box starts with
), and ends with a
The totally different example below (inspired by Wikipedia's Wikipedia RDF entry) shows an RDF fragment, i.e. a language that allows to define relationships between concepts.
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://edutechwiki.unige.ch/en/XML_principles"> <dc:title>DKS</dc:title> <dc:publisher>EduTechWiki</dc:publisher> </rdf:Description> </rdf:RDF>
The rdf:RDF "box" includes a ref:Description box, that in turn include a dc:title and dc:publisher boxes.
All XML documents must be well-formed. XML documents can be valid with respect to a grammar (also called schema, document type, language, etc.). See below for details.
3 Well-formed XML documents
Any XML document must be at least well-formed. Well-formed XML documents obey the following rules:
(1) A document must start with an XML declaration, including the XML version number
You may specify an encoding scheme (default is utf-8). Of course this means that you'll have to stick to this encoding ! Make sure to check your editor's settings.
<?xml version="1.0" encoding="ISO-8859-1"?>
We suggest not to use any other encoding than UTF-8. However, you may have to deal with legacy XML documents that do use a restricted encoding scheme like ISO-8859-1.
(2) The XML structure must be hierarchical
- start-tags and end-tags must match
- no cross-overs like in the following bad example
<i>...<b>...</i> .... </b>
- pay attention to case sensitivity, e.g. "LI" is not "li"
- "EMPTY" tags must use self-closing, e.g. <br></br> should be written as <br/>, a lonely <br> would be illegal
(3) Attributes must have values and values must be quoted:
- e.g. <a href="http://scholar.google.com"> or <person status="employed">
- e.g. <input type="radio" checked="checked">
(4) A single root element per document
- The root element opens and closes content
- The root element should not appear in the definition part of any other element
(5) Special characters: <, &, >,", and ’. Use one of the five predefined characters:
< & > " '
<, &, >, ", '
This principle also applies also to URLs !!
Example of a minimal well-formed XML document:
<?xml version="1.0" ?> <page updated="jan 2007"> <title>Hello friend</title> <content> Here is some content :) </content> <comment> Written by DKS/Tecfa </comment> </page>
- has an XML declaration on top
- has a root element (i.e. page)
- elements are nested and tags are closed
- the updated attribute has quoted value
4 XML names and CDATA Sections
Names used for elements should start with a letter and only use letters, numbers, the underscore, the hyphen and the period (no other punctuation marks) !
- Good: <driversLicenceNo> <drivers_licence_no>
- Bad: <driver’s_licence_number> <driver’s_licence_#> <drivers licence number>
When you want to display data that includes "XMLish" things like the < sign that should not be interpreted, then you can use so called CDATA Sections:
<?xml version="1.0" ?> <example> <!CDATA[ (x < y) is a math expression ]]> </example>
5 Valid XML documents
Un valid document must be:
- “well-formed” (see above)
- conform to a grammar (also called "schema"). In particular, a valid XML documents only uses elements (tags) and attributes defined by the grammar. It also respects nesting, ordering and other constraints defined by that grammar.
Kinds of XML grammars
The exists several types of XML grammars.
- DTDs (Document Type Definitions) are part of the XML standard
- XML Schema (XSD) is a more recent W3C standard, used to express stronger constraints
- Relax NG (RNG,RNC) is a OASIS standard (made by well known XML experts and who don’t like XML Schema ...). It has functionality comparable to XML Schema.
- Schematron. A complementary standard that is used to define additional constraints that can't be expressed with either XML Schema or Relax NG
6 Name spaces
It is possible to use several vocabularies within a well-formed document. If the markup language formally includes compound languages, such documents also can be validated
- E.g. there is a so-called profile for XHtml + SVG + MathML
Now, image that you just could mix tags from different languages together. The problem would be that the client application could not know which tags belong to which XML language. Also, there could be so-called naming conflicts (e.g. "title" does not means the same thing in XHTML and SVG). To address these problems so-called name-spaces have been invented, one can prefix element and attribute names with a label that represents a name space
Declaring name spaces for additional vocabularies
The "xmlns: name_space" attribute allows to introduce a new vocabulary. It tells that all elements or attributes prefixed by "name_space" belong to a different vocabulary
SVG within (true) XHTML example
<?xml version="1.0" ?> <html xmlns:svg="http://www.w3.org/2000/svg"> <svg:rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
- xmlns:svg = "..." means that svg: prefixed elements are part of SVG
Note: This example only works if the *.xhtml file is served as XML from the server. On your local PC, you can try to rename the file into *.xml.
XLink is a language to define links (only works with Firefox-based browsers)
<?xml version="1.0" ?> <RECIT xmlns:xlink="http://www.w3.org/1999/xlink"> <INFOS> <Date>30 octobre 2003 - </Date> <Auteur>DKS - </Auteur> <A xlink:href="http://jigsaw.w3.org/css-validator/check/referer" xlink:type="simple">CSS Validator</A> </INFOS>
Namespace declaration for the main vocabulary
The main vocabulary can be introduced by an attribute like:
Some specifications (e.g. SVG or XHTML) require a name space declaration in any case (even if you do not use any other vocabulary) !
SVG namespace example
<?xml version="1.0" ?> <svg xmlns="http://www.w3.org/2000/svg"> <rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
What are Namespace URLs ?
URLs that define namespaces are just names, there doesn’t need to be a real link. E.g. for your own purposes you could very well make up something like:
<?xml version="1.0" ?> <account xmls:pin = "http://joe.miller.com/pin"> <pin:name>Joe</pin:name> </account>
... and the URL http://joe.miller.com/pin doesn’t need to exist for real.
7 XML with style
XML per se doesn't say anything about display and style, however:
- Some languages like HTML or SVG or X3D do have built-in rendering mechanisms
- XML documents can be associated with a CSS stylesheet for rendering in a web browser. However, using CSS only makes sense when the XML is text-centric and contents are embedded withing tags (as opposed to attributes). Read the CSS for XML tutorial if you want to learn more.
- XSLT allows to translate and XML document into something else, e.g. you could translate your own little XML language into HTML or SVG for display.
- Other specialized styling languages exist, like XSL-FO for producing print documents. For example, you could produce a PDF file from your XML using XSLT + XSL-FO
Remember: XML per se cannot include media (e.g. pictures), doesn't understand links, doesn't have style. XML does not exist. XML languages do....