XML principles: Difference between revisions

The educational technology and digital learning wiki
Jump to navigation Jump to search
(Created page with "{{under construction}} <div class="tut_goals"> '''Learning goals''' * Understand the technical structure of an XML document * Understand well-formed documents * Understand v...")
 
m (first version)
Line 1: Line 1:
{{under construction}}
<pageby nominor="false" comments="false"/>
{{web technology tutorial|beginner}}
{{incomplete}}


<div class="tut_goals">
<div class="tut_goals">
Line 19: Line 21:


</div>
</div>
== Structure of an XML document ==
'''An XML document usually includes:'''
# Processing instructions (at least an XML declaration on top !)
# Declarations, in particular a Document Schema lik a [[DTD tutorial|DTD]]
# Element markup: content delimited by tags like <nowiki><my_tag>contents</my_tag></nowiki> or tags without contents like <nowiki><self_closing_tag/></nowiki>
# Attribute markup like <nowiki><my_tag style="green">....</my_tag> </nowiki>
# Entities (i.e. symbols that are substituted by other contents at runtime)
# comments: <nowiki><!-- .... --></nowiki>
'''XML documents are trees'''
For a computer person, an XML document is a tree (“boxes within boxes”). Inside a browser or most other clients, the document is represented as a tree-based data structure, the so-called Document Object Model (DOM)
Below is a CALS (Docbook) table example, i.e. both an XML markup and a graphic that shows its tree structure.
<source lang="XML">
<TABLE> 
  <TBODY>
    <TR>
      <TD>Pierre Muller</TD>
      <TD>http://pm.com/</TD>
    </TR>
    <TR> <TD>Elisabeth Dupont</TD> <TD></TD> </TR>
  </TBODY>
</TABLE>
</source>
[[image:xml-edit2.png|frame|none|Tree representation of a table display structure]]
'''All XML documents must be well-formed'''. XML documents ''can'' be '''valid''' with respect to a '''grammar''' (also called schema, document type, language, etc.). See below for details.
==  Well-formed XML documents ==
'''Any''' XML document must be at least '''well-formed'''. Well-formed XML documents obey the following rules:
(1) A '''document must start with an XML declaration''', including the XML version number
<?xml version="1.0"?>
You may specify an encoding scheme (default is utf-8). Of course this means that you'll have to stick to this encoding ! Make sure to check your editor's settings.
<?xml version="1.0" encoding="ISO-8859-1"?>
We suggest '''not''' to use any other encoding than UTF-8. However, you may have to deal with legacy XML documents that do use a restricted encoding scheme like ISO-8859-1.
(2) The XML '''structure must be hierarchical'''
* '''start-tags''' and '''end-tags''' must '''match'''
* '''no cross-overs''' like in the following bad example
<source lang="xml">
  <i>...<b>...</i> .... </b>
</source>
* pay attention to '''case sensitivity''', e.g. "LI" is not "li"
* "EMPTY" tags must use '''self-closing''', e.g. <nowiki><br></br></nowiki> should be written as <nowiki><br/></nowiki>, a lonely <nowiki><br></nowiki> would be illegal
(3) '''Attributes''' must have values and '''values must be quoted''':
: e.g. <a href="http://scholar.google.com"> or <person status="employed">
: e.g. <input type="radio" checked="checked">
(4) A '''single root element''' per document
: The root element opens and closes content
: The root element should not appear in the definition part of any other element
(5) '''Special characters''': <, &, >,", and ’. Use one of the five predefined characters:
<source lang="xml">
&lt; &amp; &gt; &quot; &apos;
</source>
instead of
<source lang="xml">
<, &, >, ", '
</source>
This principle also applies also to URLs !!
:bad:  http://truc.unige.ch/programme?bla&machin
:good: http://truc.unige.ch/programme?bla&amp;amp;machin
Example of a minimal well-formed XML document:
<source lang="xml">
<?xml version="1.0" ?>
<page updated="jan 2007">
  <title>Hello friend</title>
  <content> Here is some content :) </content>
  <comment> Written by DKS/Tecfa </comment>
</page>
</source>
This example:
* has an XML declaration on top
* has a root element (i.e. '''page''')
* elements are nested and tags are closed
* the ''updated'' attribute has quoted value
== XML names and CDATA Sections ==
Names used for elements should start with a letter and only use letters, numbers, the underscore, the hyphen and the period (no other punctuation marks) !
: Good: <driversLicenceNo> <drivers_licence_no>
: Bad: <driver’s_licence_number> <driver’s_licence_#> <drivers licence number>
When you want to display data that includes "XMLish" things like the &lt; sign that should not be interpreted, then you can use so called CDATA Sections:
<source lang="xml">
<example>
  <!CDATA[
  (x < y) is a math expression
]]>
</example>
</source>
== Valid XML documents ==
Un valid document must be
# “well-formed” (see above)
# conform to a grammar (also called "schema"), .e.g.  only use tags defined by the grammar and respect nesting, ordering and other constraints defined by that grammar.
'''Kinds of XML grammars''':
* '''DTD'''s are part of the XML standard
* '''XML Schema''' (XSD) is a more recent W3C standard, used to express stronger constraints
* '''Relax NG''' (RNG,RNC) is a OASIS standard (made by well known XML experts and who don’t like XML Schema ...). It has functionality comparable to XML Schema.
* '''Schematron'''. A complementary standard that is used to define additional constraints that can't be expressed with either XML Schema or Relax NG
== Name spaces ==
It is possible to use several vocabularies within a well-formed document. If the markup language formally includes compound languages, such documents also can be validated
: E.g. there is a so-called profile for XHtml + SVG + MathML
Now, image that you just could mix tags from different languages together. The problem would be that the client application could not know which tags belong to which XML language. Also, there could be so-called naming conflicts (e.g. "title" does not means the same thing in XHTML and SVG). To address these problems so-called name-spaces have been invented, one can prefix element and attribute names with a label that represents a '''name space'''
'''Declaring name spaces for additional vocabularies'''
The "'''xmlns: name_space'''" attribute allows to introduce a new vocabulary. It tells that all elements or attributes prefixed by "''name_space''" belong to a different vocabulary
Syntax:
:xmlns:''name_space''="URL_name_of_name_space"
'''SVG within XHTML example'''
<source lang="xml">
<html xmlns:svg="http://www.w3.org/2000/svg">
    <svg:rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
</source>
: '''xmlns:svg''' = "..." means that '''svg:''' prefixed elements are part of SVG
'''Xlink example''':
XLink is a language to define links (only works with Firefox-based browsers)
<source lang="xml">
<RECIT xmlns:xlink="http://www.w3.org/1999/xlink">
<INFOS>
  <Date>30 octobre 2003 - </Date>
  <Auteur>DKS - </Auteur>
  <A xlink:href="http://jigsaw.w3.org/css-validator/check/referer"
      xlink:type="simple">CSS Validator</A>
  </INFOS>
</source>
'''Namespace declaration for the main vocabulary'''
The main vocabulary can be introduced by an attribute like:
  ''xmlns="URL_name_of_name_space"''
Some specifications (e.g. SVG or XHTML) require a name space declaration in any case (even if you do not use any other vocabulary) !
'''SVG namespace example'''
<source lang="xml">
<svg xmlns="http://www.w3.org/2000/svg">
    <rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
</source>
'''What are Namespace URLs''' ?
URLs that define namespaces are '''just names''', there doesn’t need to be a real link. E.g. for your own purposes you could very well make up something like:
<source lang="xml">
<account xmls:pin = "http://joe.miller.com/pin">
  <pin:name>Joe</pin:name>
</account>
</source>
... and the URL http://joe.miller.com/pin doesn’t need to exist for real.
== XML with style ==
XML per se doesn't say anything about display and style, however:
* Some languages like [[HTML]] or [[SVG]] or [[X3D]] do have built-in rendering mechanisms
* XML documents can be associated with a CSS stylesheet for rendering in a web browser. However, using CSS only makes sense when the XML is text-centric and contents are embedded withing tags (as opposed to attributes). Read the [[CSS for XML tutorial]] if you want to learn more.
* [[XSLT Tutorial - Basics|XSLT]] allows to translate and XML document into something else, e.g. you could translate your own little XMLL language into HTML or SVG for display.
* Other specialized styling languages exist, like XSL-FO for producing print documents.




[[Category: XML]]
[[Category: XML]]

Revision as of 17:58, 15 March 2013

<pageby nominor="false" comments="false"/>

Learning goals

  • Understand the technical structure of an XML document
  • Understand well-formed documents
  • Understand valid documents
Prerequisites
Next steps



Structure of an XML document

An XML document usually includes:

  1. Processing instructions (at least an XML declaration on top !)
  2. Declarations, in particular a Document Schema lik a DTD
  3. Element markup: content delimited by tags like <my_tag>contents</my_tag> or tags without contents like <self_closing_tag/>
  4. Attribute markup like <my_tag style="green">....</my_tag>
  5. Entities (i.e. symbols that are substituted by other contents at runtime)
  6. comments: <!-- .... -->

XML documents are trees

For a computer person, an XML document is a tree (“boxes within boxes”). Inside a browser or most other clients, the document is represented as a tree-based data structure, the so-called Document Object Model (DOM)

Below is a CALS (Docbook) table example, i.e. both an XML markup and a graphic that shows its tree structure.

 <TABLE>  
   <TBODY>
     <TR> 
       <TD>Pierre Muller</TD> 
       <TD>http://pm.com/</TD> 
     </TR>
     <TR> <TD>Elisabeth Dupont</TD> <TD></TD> </TR> 
   </TBODY> 
 </TABLE>
Tree representation of a table display structure

All XML documents must be well-formed. XML documents can be valid with respect to a grammar (also called schema, document type, language, etc.). See below for details.

Well-formed XML documents

Any XML document must be at least well-formed. Well-formed XML documents obey the following rules:

(1) A document must start with an XML declaration, including the XML version number

<?xml version="1.0"?>

You may specify an encoding scheme (default is utf-8). Of course this means that you'll have to stick to this encoding ! Make sure to check your editor's settings.

<?xml version="1.0" encoding="ISO-8859-1"?> 

We suggest not to use any other encoding than UTF-8. However, you may have to deal with legacy XML documents that do use a restricted encoding scheme like ISO-8859-1.

(2) The XML structure must be hierarchical

  • start-tags and end-tags must match
  • no cross-overs like in the following bad example
  <i>...<b>...</i> .... </b>
  • pay attention to case sensitivity, e.g. "LI" is not "li"
  • "EMPTY" tags must use self-closing, e.g. <br></br> should be written as <br/>, a lonely <br> would be illegal

(3) Attributes must have values and values must be quoted:

e.g. <a href="http://scholar.google.com"> or <person status="employed">
e.g. <input type="radio" checked="checked">

(4) A single root element per document

The root element opens and closes content
The root element should not appear in the definition part of any other element

(5) Special characters: <, &, >,", and ’. Use one of the five predefined characters:

 &lt; &amp; &gt; &quot; &apos;

instead of

 <, &, >, ", '

This principle also applies also to URLs !!

bad: http://truc.unige.ch/programme?bla&machin
good: http://truc.unige.ch/programme?bla&amp;machin

Example of a minimal well-formed XML document:

 <?xml version="1.0" ?>
 <page updated="jan 2007">
  <title>Hello friend</title>
  <content> Here is some content :) </content> 
  <comment> Written by DKS/Tecfa </comment>
 </page>

This example:

  • has an XML declaration on top
  • has a root element (i.e. page)
  • elements are nested and tags are closed
  • the updated attribute has quoted value

XML names and CDATA Sections

Names used for elements should start with a letter and only use letters, numbers, the underscore, the hyphen and the period (no other punctuation marks) !

Good: <driversLicenceNo> <drivers_licence_no>
Bad: <driver’s_licence_number> <driver’s_licence_#> <drivers licence number>

When you want to display data that includes "XMLish" things like the < sign that should not be interpreted, then you can use so called CDATA Sections:

 <example> 
  <!CDATA[ 
   (x < y) is a math expression
 ]]>
</example>

Valid XML documents

Un valid document must be

  1. “well-formed” (see above)
  2. conform to a grammar (also called "schema"), .e.g. only use tags defined by the grammar and respect nesting, ordering and other constraints defined by that grammar.

Kinds of XML grammars:

  • DTDs are part of the XML standard
  • XML Schema (XSD) is a more recent W3C standard, used to express stronger constraints
  • Relax NG (RNG,RNC) is a OASIS standard (made by well known XML experts and who don’t like XML Schema ...). It has functionality comparable to XML Schema.
  • Schematron. A complementary standard that is used to define additional constraints that can't be expressed with either XML Schema or Relax NG

Name spaces

It is possible to use several vocabularies within a well-formed document. If the markup language formally includes compound languages, such documents also can be validated

E.g. there is a so-called profile for XHtml + SVG + MathML

Now, image that you just could mix tags from different languages together. The problem would be that the client application could not know which tags belong to which XML language. Also, there could be so-called naming conflicts (e.g. "title" does not means the same thing in XHTML and SVG). To address these problems so-called name-spaces have been invented, one can prefix element and attribute names with a label that represents a name space

Declaring name spaces for additional vocabularies

The "xmlns: name_space" attribute allows to introduce a new vocabulary. It tells that all elements or attributes prefixed by "name_space" belong to a different vocabulary

Syntax:

xmlns:name_space="URL_name_of_name_space"

SVG within XHTML example

 <html xmlns:svg="http://www.w3.org/2000/svg">
    <svg:rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
xmlns:svg = "..." means that svg: prefixed elements are part of SVG

Xlink example:

XLink is a language to define links (only works with Firefox-based browsers)

 <RECIT xmlns:xlink="http://www.w3.org/1999/xlink">
 <INFOS>
   <Date>30 octobre 2003 - </Date>
   <Auteur>DKS - </Auteur>
   <A xlink:href="http://jigsaw.w3.org/css-validator/check/referer"
      xlink:type="simple">CSS Validator</A>
  </INFOS>

Namespace declaration for the main vocabulary

The main vocabulary can be introduced by an attribute like:

 xmlns="URL_name_of_name_space"

Some specifications (e.g. SVG or XHTML) require a name space declaration in any case (even if you do not use any other vocabulary) !

SVG namespace example

 <svg xmlns="http://www.w3.org/2000/svg">
    <rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....

What are Namespace URLs ?

URLs that define namespaces are just names, there doesn’t need to be a real link. E.g. for your own purposes you could very well make up something like:

 <account xmls:pin = "http://joe.miller.com/pin">
   <pin:name>Joe</pin:name>
 </account>

... and the URL http://joe.miller.com/pin doesn’t need to exist for real.

XML with style

XML per se doesn't say anything about display and style, however:

  • Some languages like HTML or SVG or X3D do have built-in rendering mechanisms
  • XML documents can be associated with a CSS stylesheet for rendering in a web browser. However, using CSS only makes sense when the XML is text-centric and contents are embedded withing tags (as opposed to attributes). Read the CSS for XML tutorial if you want to learn more.
  • XSLT allows to translate and XML document into something else, e.g. you could translate your own little XMLL language into HTML or SVG for display.
  • Other specialized styling languages exist, like XSL-FO for producing print documents.