XML principles

The educational technology and digital learning wiki
Jump to navigation Jump to search

Introduction

The purpose of this article is to introduce some technical XML principles.

Learning goals

  • Understand the technical structure of an XML document
  • Understand well-formed documents
  • Understand valid documents
Prerequisites
  • Some idea what XML is about
Next steps


The two most fundamental ideas you should start believing are the following:

  • XML does not exist as a language. It is just a formalism for creating languages
  • XML can be used to describe almost any kind of structure

Below, we shall introduce the technical barebones of XML.

Structure of an XML document

An XML document usually includes:

  1. Processing instructions (at least an XML declaration on top !)
  2. Declarations, in particular a Document Schema lik a DTD
  3. Element markup: content delimited by tags like <my_tag>contents</my_tag> or tags without contents like <self_closing_tag/>
  4. Attribute markup like <my_tag style="green">....</my_tag> . style would be the attribute and "green"> the attribute value.
  5. Entities (i.e. symbols that are substituted by other contents at runtime)
  6. comments <!-- .... -->

XML documents are trees

For a computer person, an XML document has a so-called tree structure. We also call it “boxes within boxes”. Inside a browser or most other clients, the document is represented as a tree-based data structure, the so-called Document Object Model (DOM)

Below is an XML fragment for a CALS (Docbook) table example:

<?xml version="1.0"?>
 <TABLE>  
   <TBODY>
     <TR> 
       <TD>Pierre Muller</TD> 
       <TD>http://pm.com/</TD> 
     </TR>
     <TR> <TD>Elisabeth Dupont</TD> <TD></TD> </TR> 
   </TBODY> 
 </TABLE>

A graphical representation of this tree looks like this:

Tree representation of a table display structure

If we look at this as "boxes within boxes", there is a TABLE box that includes a TBODY box. The TBODY box includes two TR boxes. Each of the latter include two TD boxes. In the XML markup language, each tree element or box starts with <TAG> (e.g. ), and ends with a </TAG> (e.g. </TR>) The totally different example below (inspired by Wikipedia's Wikipedia RDF entry) shows an RDF fragment, i.e. a language that allows to define relationships between concepts.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/">

        <rdf:Description rdf:about="http://edutechwiki.unige.ch/en/XML_principles">
                <dc:title>DKS</dc:title>
                <dc:publisher>EduTechWiki</dc:publisher>
        </rdf:Description>
</rdf:RDF>

The rdf:RDF "box" includes a ref:Description box, that in turn include a dc:title and dc:publisher boxes.


All XML documents must be well-formed. XML documents can be valid with respect to a grammar (also called schema, document type, language, etc.). See below for details.

Well-formed XML documents

Any XML document must be at least well-formed. Well-formed XML documents obey the following rules:

(1) A document must start with an XML declaration, including the XML version number

<?xml version="1.0"?>

You may specify an encoding scheme (default is utf-8). Of course this means that you'll have to stick to this encoding ! Make sure to check your editor's settings.

<?xml version="1.0" encoding="ISO-8859-1"?> 

We suggest not to use any other encoding than UTF-8. However, you may have to deal with legacy XML documents that do use a restricted encoding scheme like ISO-8859-1.

(2) The XML structure must be hierarchical

  • start-tags and end-tags must match
  • no cross-overs like in the following bad example
  <i>...<b>...</i> .... </b>
  • pay attention to case sensitivity, e.g. "LI" is not "li"
  • "EMPTY" tags must use self-closing, e.g. <br></br> should be written as <br/>, a lonely <br> would be illegal

(3) Attributes must have values and values must be quoted:

e.g. <a href="http://scholar.google.com"> or <person status="employed">
e.g. <input type="radio" checked="checked">

(4) A single root element per document

The root element opens and closes content
The root element should not appear in the definition part of any other element

(5) Special characters: <, &, >,", and ’. Use one of the five predefined characters:

 &lt; &amp; &gt; &quot; &apos;

instead of

 <, &, >, ", '

This principle also applies also to URLs !!

bad: http://truc.unige.ch/programme?bla&machin
good: http://truc.unige.ch/programme?bla&amp;machin

Example of a minimal well-formed XML document:

<?xml version="1.0" ?>
 <page updated="jan 2007">
  <title>Hello friend</title>
  <content> Here is some content :) </content> 
  <comment> Written by DKS/Tecfa </comment>
 </page>

This example:

  • has an XML declaration on top
  • has a root element (i.e. page)
  • elements are nested and tags are closed
  • the updated attribute has quoted value

XML names and CDATA Sections

Names used for elements should start with a letter and only use letters, numbers, the underscore, the hyphen and the period (no other punctuation marks) !

Good: <driversLicenceNo> <drivers_licence_no>
Bad: <driver’s_licence_number> <driver’s_licence_#> <drivers licence number>

When you want to display data that includes "XMLish" things like the < sign that should not be interpreted, then you can use so called CDATA Sections:

<?xml version="1.0" ?>
 <example> 
  <!CDATA[ 
   (x < y) is a math expression
 ]]>
</example>

Valid XML documents

Un valid document must be:

  1. “well-formed” (see above)
  2. conform to a grammar (also called "schema"). In particular, a valid XML documents only uses elements (tags) and attributes defined by the grammar. It also respects nesting, ordering and other constraints defined by that grammar.

Kinds of XML grammars

The exists several types of XML grammars.

  • DTDs (Document Type Definitions) are part of the XML standard
  • XML Schema (XSD) is a more recent W3C standard, used to express stronger constraints
  • Relax NG (RNG,RNC) is a OASIS standard (made by well known XML experts and who don’t like XML Schema ...). It has functionality comparable to XML Schema.
  • Schematron. A complementary standard that is used to define additional constraints that can't be expressed with either XML Schema or Relax NG

Name spaces

It is possible to use several vocabularies within a well-formed document. If the markup language formally includes compound languages, such documents also can be validated

E.g. there is a so-called profile for XHtml + SVG + MathML

Now, image that you just could mix tags from different languages together. The problem would be that the client application could not know which tags belong to which XML language. Also, there could be so-called naming conflicts (e.g. "title" does not means the same thing in XHTML and SVG). To address these problems so-called name-spaces have been invented, one can prefix element and attribute names with a label that represents a name space

Declaring name spaces for additional vocabularies

The "xmlns: name_space" attribute allows to introduce a new vocabulary. It tells that all elements or attributes prefixed by "name_space" belong to a different vocabulary

Syntax:

xmlns:name_space="URL_name_of_name_space"

SVG within (true) XHTML example

<?xml version="1.0" ?>
 <html xmlns:svg="http://www.w3.org/2000/svg">
    <svg:rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
xmlns:svg = "..." means that svg: prefixed elements are part of SVG

Note: This example only works if the *.xhtml file is served as XML from the server. On your local PC, you can try to rename the file into *.xml.


Xlink example:

XLink is a language to define links (only works with Firefox-based browsers)

<?xml version="1.0" ?>
 <RECIT xmlns:xlink="http://www.w3.org/1999/xlink">
 <INFOS>
   <Date>30 octobre 2003 - </Date>
   <Auteur>DKS - </Auteur>
   <A xlink:href="http://jigsaw.w3.org/css-validator/check/referer"
      xlink:type="simple">CSS Validator</A>
  </INFOS>

Namespace declaration for the main vocabulary

The main vocabulary can be introduced by an attribute like:

 xmlns="URL_name_of_name_space"

Some specifications (e.g. SVG or XHTML) require a name space declaration in any case (even if you do not use any other vocabulary) !

SVG namespace example

<?xml version="1.0" ?>
 <svg xmlns="http://www.w3.org/2000/svg">
    <rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....

What are Namespace URLs ?

URLs that define namespaces are just names, there doesn’t need to be a real link. E.g. for your own purposes you could very well make up something like:

<?xml version="1.0" ?>
 <account xmls:pin = "http://joe.miller.com/pin">
   <pin:name>Joe</pin:name>
 </account>

... and the URL http://joe.miller.com/pin doesn’t need to exist for real.

XML with style

XML per se doesn't say anything about display and style, however:

  • Some languages like HTML or SVG or X3D do have built-in rendering mechanisms
  • XML documents can be associated with a CSS stylesheet for rendering in a web browser. However, using CSS only makes sense when the XML is text-centric and contents are embedded withing tags (as opposed to attributes). Read the CSS for XML tutorial if you want to learn more.
  • XSLT allows to translate and XML document into something else, e.g. you could translate your own little XML language into HTML or SVG for display.
  • Other specialized styling languages exist, like XSL-FO for producing print documents. For example, you could produce a PDF file from your XML using XSLT + XSL-FO

Remember: XML per se cannot include media (e.g. pictures), doesn't understand links, doesn't have style. XML does not exist. XML languages do....