Editing XML tutorial
This article was created by importing slides .... some errors remain to be fixed.
Introduction
This is a beginners tutorial for XML editing made from slides
- Objectives
- Be able to read schemas and find other documentation
- Understand the necessity of using an XML editor
- Be able to edit XML without hand-editing tags, profit from DTD and Schema awareness
- Be able to check well-formedness and validate
- Be able to fix errors
- Prerequisites
- Some idea what XML is about
- XML namespaces (some, for more information, have a look at the XML namespace article)
- HTML and CSS (some)
- Next steps
- DTD tutorial
- XSLT Tutorial - Basics
- XPath tutorial - basics
- XQuery tutorial - basics (if you have interest in XML databases)
- PHP - MySQL - XML tutorial - basics (shows how to display an XML result-set retrieved form MySQL with XSLT)
XML Principles
Let's recall some principles that you also may have read in the XML article.
Structure of an XML document
An XML document usually includes:
- Processing instructions (at least an XML declaration on top !)
- Declarations (in particular a DTD)
- Element markup with content delimited by tags like <start_tag>contents </endtag> or without markup without contents like <self_closing_tag/>
- Attribute markup (optionally), i.e. attributes
- Entities (i.e. symbols that are subsituted by other contents at runtime)
- comments: <!-- .... -->
XML documents are trees
For a computer person, an XML document is a tree (“boxes within boxes”). Inside a browser or most other clients, the document is represented as a tree-based data structure, the so-called Document Object Model (DOM)
Below is a CALS (Docbook) table example, i.e. both an XML markup and a graphic that shows its tree structure.
<TABLE>
<TBODY>
<TR> <TD>Pierre Muller</TD> <TD>http://pm.com/</TD> </TR>
<TR> <TD>Elisabeth Dupont</TD> <TD></TD> </TR>
</TBODY>
</TABLE>
All XML documents must be well-formed. XML documents can be valid with respect to a grammar (also called schema, document type, language, etc.). See below for details.
Well-formed and valid XML documents
Well-formed XML documents must obey the following rules:
(1) A document must start with an XML declaration (including version number !)
<?xml version="1.0"?>
You may specify and encoding (default is utf-8). Of course this means that you'll have to stick to an encoding ! Make sure to check your editor's settings.
<?xml version="1.0" encoding="ISO-8859-1"?>
(2) XML structure must be hierarchical
- start-tags and end-tags must match
- no cross-overs as in
<i>...<b>...</i> .... </b>
- pay attention to case sensitivity, e.g. "LI" is not "li"
- "EMPTY" tags must use self-closing, e.g. <br></br> should be written as <br/>, a lonely <br> would be illegal
(3) Attributes must have values and values are quoted:
- e.g. <a href="http://scholar.google.com"> or <person status="employed">
- e.g. <input type="radio" checked="checked">
(4) A single root element per document
- Root element opens and closes content
- The root element should not appear in any other element
(5) Special characters: <, &, >,", and ’. Use one of the five predefined characters:
< & > " '
instead of
<, &, >, ", '
This principle also applies also to URLs !!
Example of a minimal well-formed XML document:
<?xml version="1.0" ?>
<page updated="jan 2007">
<title>Hello friend</title>
<content> Here is some content :) </content>
<comment> Written by DKS/Tecfa </comment>
</page>
This example:
- has an XML declaration on top
- has a root element (i.e. page)
- Elements are nested and tags are closed
- Attribute has quoted value
XML names and CDATA Sections
Names used for elements should start with a letter and only use letters, numbers, the underscore, the hyphen and the period (no other punctuation marks) !
- Good: <driversLicenceNo> <drivers_licence_no>
- Bad: <driver’s_licence_number> <driver’s_licence_#> <drivers licence number>
When you want to display data that includes "XMLish" things like the < sign that should not be interpreted, then you can use so called CDATA Sections:
<example>
<!CDATA[
(x < y) is a math expression
]]>
</example>
Valid XML documents
Un valid document must be
- “well-formed” (see above)
- conform to a grammar (also called "schema"), .e.g. only use tags defined by the grammar and respect nesting, ordering and other constraints defined by that grammar.
Kinds of XML grammars:
- DTDs are part of the XML standard
- XML Schema (XSD) is a more recent W3C standard, used to express stronger constraints
- Relax NG (RNG,RNC) is a OASIS standard (made by well known XML experts and who don’t like XML Schema ...)
- Schematron. A complementary standard that is used to define additional constraints that can't be expressed with either XML Schema or Relax NG
Name spaces
It is possible to use several vocabularies within a well-formed document. If the markup language says so, such documents also can be validated
- E.g. XHtml + SVG + MathML
However, the problem is then that the client application would not know which tags belong to which XML language. Also, there could be so-called naming conflicts (e.g. "title" does not means the same thing in XHTML and SVG).
To adress these problems one can prefix element and attribute names with a so-called name space
Declaring additional vocabularies
The "xmlns:name_space" attribute allows to introduce a new vocabulary. It tells that all elements or attributes prefixed by "name_space" belong to a different vocabulary
- xmlns:name_space="URL_name_of_name_space"
SVG within XHTML example
<html xmlns:svg="http://www.w3.org/2000/svg">
<svg:rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
- xmlns:svg = "..." means that svg: prefixed elements are part of SVG
Xlink example:
XLink is a language to define links (only works with Firefox-based browsers)
<RECIT xmlns:xlink="http://www.w3.org/1999/xlink">
<INFOS>
<Date>30 octobre 2003 - </Date>
<Auteur>DKS - </Auteur>
<A xlink:href="http://jigsaw.w3.org/css-validator/check/referer"
xlink:type="simple">CSS Validator</A>
</INFOS>
Namespace declaration for the main vocabulary
The main vocabulary can be introduced by an attribute like:
xmlns="URL_name_of_name_space"
Some specifications (e.g. SVG) require a name space declaration in any case (even if you do not use any other vocabulary) !
SVG namespace example
<svg xmlns="http://www.w3.org/2000/svg">
<rect x="50" y="50" rx="5" ry="5" width="200" height="100" ....
What are Namespace URLs ?
URLs that define namespaces are just names, there doesn’t need to be a real link. E.g. for your own puporses you can very well make up something like:
<account xmls:pein = "http://joe.miller.com/pein">
<pein:name>Joe</pein:name>
</account>
... and the URL http://joe.miller.com/pein doesn’t need to exist.
XML with style
XML per se doesn't say anything about display and style, however:
- Some languages like HTML or SVG or X3D do have built-in rendering mechanisms
- XML documents can be associated with a CSS stylesheet for rendering in a web browser. However, using CSS only makes sense when the XML is text-centric and contents are embedded withing tags (as opposed to attributes)
- XSLT alllows to translate and XML document into something else, e.g. your own little language into HTML or SVG for display.
- Other specialized styling languages exist, like XSL-FO for producing print documents.
Using DTDs (Document Type Definitions)
DTD grammars are just a set of rules that define:
- a set of elements (tags) and their attributes that can be used;
- how elements can be embedded;
- different sorts of entities (reusable fragments, special characters).
DTDs can’t define what the character data (element contents) and most attribute values look like.
Specification of a markup language. Is a DTD enough ?
The most important part in a specification (e.g. for XHTML) is usually the DTD, but in addition other constraints can be added ! In particular:
- The DTD does not identify the root element ! You have to tell the users what elements can be root elements
- Since DTDs can’t express data constraints, you may write out additional ones in a specification document
- e.g. "the value of length attribute is a string composed of a number one of "inch", "em"
<size length="10cm">
A simple DTD example
<!ELEMENT page (title, content, comsment?)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT content (#PCDATA)>
<!ELEMENT comment (#PCDATA)>
A DTD document contains just definition of rules .... nothing else (see later for explanations)
Using a DTD with an XML document
A valid XML document my include a declaration that identifies a DTD to be used. Therefore, the <!DOCTYPE...> declaration is part of the XML file, not of the DTD ....
Example of an XML file with a DTD declaration
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE hello SYSTEM "hello.dtd">
There are four ways of using a DTDs
(1) No DTD (XML document will just be well-formed)
(2) DTD rules are defined inside the XML document
In that case, we get a "standalone" document (the XML document is self-sufficient)
(3) Private/System" DTDs, the DTD is located on the system (own computer or the Internet)
That’s what you are going to use when you write your own DTDs
(4) Public DTDs, i.e. we use an official name for the DTD.
This implies that both your XML editor and the user software knows the DTD. It's a strategy used for common Web technology DTDs like XHTML, SVG, MathML, etc.
Where to insert the DTD?
A DTD is always declared on top of the file after the XML declaration.
All XML declarations, DTD declaration etc. are part of the so-called prologue.
Syntax of the DTD declaration in the XML document
Every DTD declaration must start with
<!DOCTYPE .... >
The, the root element must be specified next. Remember that DTDs don’t know their root element, root is defined in the XML document ! DTDs must define this root element just like any other element ! In some cases, DTDs are meant to be used in different ways, i.e. several elements could be used as root elements.
<!DOCTYPE hello .... >
The next elements of the DTD declaration are different according to the DTD type (public or private)
(1) Syntax for internal DTDs (only !). DTD rules are inserted between brackets [ ... ]
<!DOCTYPE hello [
<!ELEMENT hello (#PCDATA)>
]>
(2) Syntax to define "private" external DTDs: The DTD is identified by the URL after the "SYSTEM" keyword
<!DOCTYPE hello SYSTEM "hello.dtd">
(3) Syntax for public DTDs: After the "PUBLIC" keyword you have to specify an official name and a backup URL that a validator could use. For example:
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
Some examples
The DTD file itself does not contain any DTD declaration, just rules. Below are some examples of XML documents with DTD declarations:
Hello XML without DTD
<?xml version="1.0" standalone="yes"?>
<hello> Hello XML et hello cher lecteur ! </hello>
Hello XML with an internal DTD
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE hello [
<!ELEMENT hello (#PCDATA)>
]>
<hello> Hello XML et hello chère lectrice ! </hello>
Hello XML with an external DTD
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE hello SYSTEM "hello.dtd">
<hello> Hello XèMèLè et hello cher lectrice ! </hello>
That’s what you should with your own home-made DTDs
XML with a public external DTD (RSS 0.91)
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel> ...... </channel>
</rss>
Understanding DTDs by example
Below we will present a few DTDs in increasing complexity.
Hello text with XML
Below is a simple XML document of type <page>:
<page>
<title>Hello friend</title>
<content>
Here is some content :)
</content>
<comment>
Written by DKS/Tecfa, adapted from S.M./the Cocoon samples
</comment>
</page>
The following DTD could validate the document:
<!ELEMENT page (title, content, comment?)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT content (#PCDATA)>
<!ELEMENT comment (#PCDATA)>
Schemas for recipes
Recipes are very popular in XML education. Let's first look at a quite simple example, originally published by Jay Greenspan (dead link)
<?xml version="1.0"?>
<!DOCTYPE list SYSTEM "simple_recipe.dtd">
<list>
<recipe>
<author>Carol Schmidt</author>
<recipe_name>Chocolate Chip Bars</recipe_name>
<meal>Dinner
<course>Dessert</course>
</meal>
<ingredients>
<item>2/3 C butter</item> <item>2 C brown sugar</item>
<item>1 tsp vanilla</item> <item>1 3/4 C unsifted all-purpose flour</item>
<item>1 1/2 tsp baking powder</item>
<item>1/2 tsp salt</item> <item>3 eggs</item>
<item>1/2 C chopped nuts</item>
<item>2 cups (12-oz pkg.) semi-sweet choc. chips</item>
</ingredients>
<directions>
Preheat oven to 350 degrees. Melt butter; combine with brown sugar and vanilla in large
mixing bowl. Set aside to cool. Combine flour, baking powder, and salt; set aside. Add
eggs to cooled sugar mixture; beat well. Stir in reserved dry ingredients, nuts, and
chips.
Spread in greased 13-by-9-inch pan. Bake for 25 to 30 minutes until golden brown; cool.
Cut into squares.
</directions>
</recipe>
</list>
The DTD would look like this
Below is half-filled in example of a sligthly more complex recipe list in XML.
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE list SYSTEM "recipe-2.dtd">
<?xml-stylesheet href="recipe-2.css" type="text/css"?>
<list>
<recipe>
<meta>
<author>Joe</author>
<date></date>
<version></version>
</meta>
<recipe_name>Vegetable soup</recipe_name>
<meal>dinner</meal>
<ingredients>
<item>4 Carrots</item>
<item>2 Onions</item>
<item>Garlic</item>
<itme>1/2 Cabbage</item>
<item>Salt</item>
<item>Pepper</item>
</ingredients>
<directions>
<para>Cut the vegies into little pieces. Then boil with water. Add some salt and pepper</para>
</directions>
</recipe>
</list>
Contents of the DTD (simple_recipe.dtd)
<!-- Simple recipe DTD -->
<!-- This DTD will allow to write simple recipees
list = a list of recipees
recipee = container for a recipee
meta = Metainformation: must include author of this file, date, version in this order
recipee_author = optional name of recipee author
mail = title of meal
ingredients = list of items you need
directions = How to cook, may include either para's or bullet's.
-->
<!ELEMENT list (recipe+)>
<!ELEMENT recipe (meta, recipe_author?, recipe_name, meal, ingredients, directions)>
<!ELEMENT meta (author, date, version)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT recipe_author (#PCDATA)>
<!ELEMENT recipe_name (#PCDATA)>
<!ELEMENT meal (#PCDATA)>
<!ELEMENT ingredients (item+)>
<!ELEMENT item (#PCDATA)>
<!ELEMENT directions (para | bullet)* >
<!ELEMENT bullet (#PCDATA|strong)*>
<!ELEMENT para (#PCDATA|strong)*>
<!ELEMENT strong (#PCDATA)>
A simple story grammar
Let's present the grammar first
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- DTD to write simple stories
Made by Daniel K. Schneider / TECFA / University of Geneva
VERSION 1.0
30/10/2003
-->
<!ELEMENT STORY (title, context, problem, goal, THREADS, moral, INFOS)>
<!ATTLIST STORY xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">
<!ELEMENT THREADS (EPISODE+)>
<!ELEMENT EPISODE (subgoal, ATTEMPT+, result) >
<!ELEMENT ATTEMPT (action | EPISODE) >
<!ELEMENT INFOS ( ( date | author | a )* ) >
<!ELEMENT title (#PCDATA) >
<!ELEMENT context (#PCDATA) >
<!ELEMENT problem (#PCDATA) >
<!ELEMENT goal (#PCDATA) >
<!ELEMENT subgoal (#PCDATA) >
<!ELEMENT result (#PCDATA) >
<!ELEMENT moral (#PCDATA) >
<!ELEMENT action (#PCDATA) >
<!ELEMENT date (#PCDATA) >
<!ELEMENT author (#PCDATA) >
<!ELEMENT a (#PCDATA)>
<!ATTLIST a
xlink:href CDATA #REQUIRED
xlink:type CDATA #FIXED "simple"
>
Below is a short story
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE STORY SYSTEM "story-grammar.dtd">
<?xml-stylesheet href="story-grammar.css" type="text/css"?>
<STORY xmlns:xlink="http://www.w3.org/1999/xlink">
<title>The little Flexer</title>
<context>Once upon a time, in a dark small office.</context>
<problem>Kaspar was trying to learn Flex but didn't have a real project. He then decided that it would be a good idea to look at Data-Driven Controls. These are most useful in combination with an external datasources in XML format.</problem>
<goal>So he decided how to write a mx:Tree application that imports XML data.</goal>
<THREADS>
<EPISODE>
<subgoal>He decided to play with a little example.</subgoal>
<ATTEMPT>
<action>So he went to see the LiveDocs and copied an example.</action>
</ATTEMPT>
<result>The example worked but he didn't understand why since he didn't know about E4X.</result>
</EPISODE>
<EPISODE>
<subgoal>He then decided to learn e4X first.
</subgoal>
<ATTEMPT>
<action>
Reading 2-3 tutorials and creating a simple example only took 2-3 hours.
</action>
</ATTEMPT>
<result>
He now understood how to write e4X code in Flex.
</result>
</EPISODE>
</THREADS>
<moral>Divide a problem into subproblems and you will get there ...</moral>
<INFOS>
<a xlink:href="http://edutechwiki.unige.ch/en/ECMAscript_for_XML"
xlink:type="simple">ECMAscript for XML</a>
</INFOS>
</STORY>
Story grammar is text centric DTD. There it can be easily styled with CSS. You can look at the file story-grammar.xml and also consult story-grammar.css.
A simple family DTD
family.dtd
A valid XML file
<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE family SYSTEM "family.dtd"> <family> <person name="Joe Miller" gender="male" type="father" id="123.456.789"/> <person name="Josette Miller" gender="female" type="girl" id="123.456.987"/> </family>
RSS
RSS is a news syndication format. There are several RSS variants. RSS 0.91 is Netscape’s original (still being used)
<!ELEMENT rss (channel)>
<!ATTLIST rss version CDATA #REQUIRED> <!-- must be "0.91"> -->
<!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? | managingEditor? | webMaster? | skipHours? | skipDays?)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT link (#PCDATA)>
<!ELEMENT image (title | url | link | width? | height? | description?)*>
<!ELEMENT url (#PCDATA)>
<!ELEMENT item (title | link | description)*>
<!ELEMENT textinput (title | description | name | link)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT rating (#PCDATA)>
<!ELEMENT language (#PCDATA)>
<!ELEMENT width (#PCDATA)>
<!ELEMENT height (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
<!ELEMENT pubDate (#PCDATA)>
<!ELEMENT lastBuildDate (#PCDATA)>
<!ELEMENT docs (#PCDATA)>
<!ELEMENT managingEditor (#PCDATA)>
<!ELEMENT webMaster (#PCDATA)>
<!ELEMENT hour (#PCDATA)>
<!ELEMENT day (#PCDATA)>
<!ELEMENT skipHours (hour+)>
<!ELEMENT skipDays (day+)>
Possible XML document for RSS
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE rss SYSTEM "rss-0.91.dtd">
<rss version="0.91">
<channel>
<title>Webster University</title>
<description>Home Page of Webster University</description>
<link>http://www.webster.edu</link>
<item>
<title>Webster Univ. Geneva</title>
<description>Home page of Webster University Geneva</description>
<link>http://www.webster.ch</link>
</item>
<item>
<title>http://www.course.com/</title>
<description>You can find Thomson text-books materials (exercise data) on this web site</description>
<link>http://www.course.com/</link>
</item>
</channel>
</rss>
Summary syntax of DTD element definitions
We will come back to this when we will learn how to write our own DTDs in the DTD tutorial (don’t worry too much about unexplained details ....)
order of elements | <!ELEMENT Name (First, Middle, Last)> | |
optional element | MiddleName? | |
at least one element | movie+ | |
zero or more elements | item* | |
pick one (or operator) | economics|law | |
grouping construct | (A,B,C) |
Entities
Most professional DTDs use entities. Entities are just symbols that contain some information which substitutes when the symbol is used ...
There exist tow kinds of entities: XML entities and DTD entities
DTD entities
Some more complex DTD use the same structures all over. Instead of typing these several times one can use a ENTITY construction like this:
<!ENTITY % Content "(Para | List | Listing)*">
Later in the DTD we then can have Element definitions like this:
<!ELEMENT Intro (Title, %Content; ) > <!ELEMENT Goal (Title, %Content; ) >
The computer will then simply translate these into:
<!ELEMENT Intro (Title, (Para | List | Listing)*) > <!ELEMENT Goal (Title, (Para | List | Listing)* ) >
... think of these entities as shortcuts.
Choosing and using an XML Editor
There a lots of XML editors and there is no easy choice ! Depending on your needs you may choose a different editor:
- To edit strongly structured data (i.e. data-centric XML) a sort of "tree" or "boxed" view is practical
- To edit text-centric data (e.g. an article) you either want a text-processor like tool are a structure editor.
- Really good XML editors cost a lot ...
Here is my own little advice with respect to XML editors (also read the XML editor article)
Minimal things your XML editor should be able to do
- Check for XML well-formedness
- Check for validity against several kinds of XML grammars (DTD, Relax NG, XML Schema)
- Highlight errors (of all sorts)
- Suggest available XML tags (in a given context). Also clearly show which ones are mandatory and which ones are optional, and display them in the right order.
- Allow the user to move/split/join elements in a more or less ergonomic way (although it is admitted that these operations need some training)
- Include support for XSLT and XQuery (However, if you have installation skills you can easily compensate lack of support by installing a processor like Saxon
We then suggest some additional criteria depending on the kind of XML
For data-centric XML:
- Allow viewing and editing of XML documents in a tree view or boxed view (or both together)
- Provide a context-dependent choice of XML tags and attributes (DTD/XSD awareness)
For text-centric XML:
- Allow editing of XML documents in a structure view
- Allow editing of XML documents in somewhat WYSIWYG view. Such a view can be based on an associated CSS (most common solution) or XSLFO (I am dreaming here) or use some proprietary format (which is not very practical for casual users!). Also allow users to switch on/off tags or element boundary markers.
- Provide a context-dependent choice of XML tags and attributes (DTD/XSD awareness). The user should be able to right-click within the XML text and not in some distant tree representation.
- Automatically insert all mandatory sub-elements when an element is created.
- Automatically complete XML Tags when working without a DTD or other schema.
- Indent properly (and assist users to indent single lines as well as the whole document)
Suggested free editor: Exchanger XML Lite V3.3
- http://www.freexmleditor.com/
- I suggest to try this editor first, try the other one if you are unhappy with it or if you plan to edit "data-centric" XML documents.
Hints for editing witch Exchanger
To insert an element or attribute:
- In the contents window press Ctrl-T to insert an element.
- Pressing "<" in the editing window gives more options and you can do it in any place.
- To insert an attribute, position the cursor after the element name and press the space bar
- Alternatively (and better if you don't know your DTD): Select the Helper pane to the left. Then (in the editing window) click on the element tag you wish to edit or put your cursor in a location between child elements. The helper pane will then display the structure of the current parent element and list available elements on which you can click to insert.
XMLmind Standard Edition is another free editor
Hints for editing with XMLmind
- Element manipulation is trough the "tree view". After selecting an element you can insert elements either by selecting (tiny) before/after/within buttons in the top right elements pane
- or use shortcuts: (ctrl-h = insert before, ctrl-i = insert within, ctrl-j = insert after). Same principle for the attributes pane.
Other Alternatives Firstly, any XML editor is difficult to learn (because XML editing is not so easy). So make an effort to learn the interface, e.g. read the help !
- Programmers also may consider using a programmer’s editor. However make sure that there is an XML plugin, that the editor is "DTD aware" (can show elements to insert in a given context) and that it can validate. Otherwise forget it !!
About Java
- Most XML editors are written in Java an rely on the "Java RunTime engine". Both websites of the recommended editors above give you a choice: Download an editor with or without Java. If you don't have Java installed on your own PC, I suggest taking it first from http://www.java.com/ ... and then always download the "no java vm" versions of the editor software
- To test if you have java, open a command terminal and type "Java". To open a command terminal: Start Menu -> Execute and then type "cmd".