HTML and XHTML validation and repair

The educational technology and digital learning wiki
Revision as of 17:18, 6 September 2009 by Daniel K. Schneider (talk | contribs)
Jump to navigation Jump to search

This article or section is currently under construction

In principle, someone is working on it and there should be a better version in a not so distant future.
If you want to modify this page, please discuss it with the person working on it (see the "history")

Introduction

Learning goals
  • Learn why standards are important and why web pages should comply with standards
  • Be able to validate HTML, find broken links and validate CSS
  • Be able to fix broken pages (i.e. understand error messages and use a repair tool)
Prerequisites
Moving on
Level and target population
  • Beginners
Remarks
  • This is a first version ...

The executive summary

There are several types of web page validation tools:

  • HTML/XHTML code validation
  • CSS (style sheet) validation
  • Links checking (are they broken ?)
  • Accessibility checking (can impaired people use the web site)

Validation tools are available in different ways:

Why should you care about valid code ?

W3C's My Web site is standard! And yours? propaganda page delivers the main arguments (quoted fragments are citations from this piece written by Karl Dubost)

  • Designing with standards will simplify Web site code maintenance because you will not have multiple versions for different browsers. Your pages will have a longer life and will not be dependant upon vaporous technologies.
  • “Technical constraints exist with any artistic medium, whether you are drawing, sculpting, or designing Web pages. Watercolors or oil paintings have their own constraints, but these techniques do not to block creativity, rather they provide structure for creative expression.” Have a look at the various designs at CSS Zen Garden that shows off 210 different cool designs that work with exactly the same XHML page.
  • “People with disabilities represents 8% to 10% of the total population. It's easier to maintain a Web site that follows accessibility guidelines (and therefore Web standards). Your Web site traffic will increase, and a wider variety of browsers will have access to site content”. Just an example: You may not care about blind people that use speech synthesizers to listen to web contents, but you want to use such a tool yourself in your car sometimes in the near future. Also you may have a cell phone and want to be able to look at the same web contents in a more linear way.
  • “Standards have been designed to keep in mind all potential audiences and technologyies. You will not be attached to any company or proprietary technology. You can use technologies that are independent of platforms requirements.”. E.g. valid HTML does run in the 2007/9 bunch of new browsers like Safari and Chrome.

The same article also provides some extra advise:

  • “Unfortunately, many books do not teach good Web programming. When you are creating a Web site, you should check the correctness of your markup. If you are a Web developer, be careful using books to develop your application and read the particular specifications which you are trying to implement.”
  • “Many authoring tools do not generate valid markup. Some have syntax checkers embedded into them, others do the right thing, and many do not generate valid markup. As an intermediate solution, you have to check your Web page with an HTML validator.”
  • May CMSs (e.g. templates uses or generators) produce bad code. There isn't much you can do about this, except complaining to the people who produce these.

Basically, what Karl Dubost is saying is that you must care about validity and you must not trust your favorite tools or even published books. You do have to learn how to validate code with independant validating tools.

The toolbox

SGML and XML validation

Since HTML code is SGML and XHTML is XML, standard SGML and XML parsers can validate the syntax of (X)HTML pages, e.g. find mispelled and illegal tags, find misplaced tags (e.g. a "p" within a "ul"), identify missing end tags and quotes for attributes etc.

Such tools can't find mistakes that relate to informally spelled out specifications (as opposed to DTDs (most HTML dialects) and XML Schemas (for some more recent standards).

Typically, complex text editors that programmers use have builtin tools (often via extensions) that validate SGML and XML languages. XML editors can validate XHTML (and other XML languages like SVG, but not HTML.

IMHO, that kind of validation is good enough for most educational sites - Daniel K. Schneider 18:16, 4 September 2009 (UTC).

The tidy program

The tidy program is the most well know validation and repair program. Since tidy is an open source library, it is also embedded in many authoring environments as well as browser extensions.

(1) The easiest way to use tidy is to install a Firefox extension called related Firefox (en_US) page. If this extension is installed and activated (by default it is), you will see a warning or error icon on the bottom right for each page you read. You then can look at the warnings by clicking on the icon or by selecting View->Source.

(2) Else you can download Tidy from Sourcefore. The website itself does not provide binaries, but will point to others who do. E.g. for Windows as of Sept 2009, the procedure is the following:

  • Go to HTML Tidy for Windows
  • Download the EXE version. You will get a *.zip file called tidy.zip
  • Extract the single tidy.exe to some directory, e.g. c:\bin\
  • You the may (optionally) Set the Path Environment variable so that you can type the tidy command in any Window your open.

(3) There exist graphical user interfaces to tidy which you may use in addition (or alternatively). E.g. GUI for Tidy by Dirk Paehl. It is distributed as a zip file. Extrat all files to a single directory. It will run from any device (e.g. a memory stick) and does interfere with the operating system, e.g. MS Window's registry.

(4) Many open source HTML/text editors do have an interface to tidy, but you may need to tell the program where to find this software. Some editors directly repair your code (see below), therefore you should make a copy before using tidy! E.g. in NoteTab Ligh, Use menu Tools->Tidy HTML code (or hit CTRL-F7). It will automatically fix errors and then display errors and warnings prior to the repair.

Fixing code with Tidy

Tidy is a fairly complex program that allows not only to validate various HTML dialects, but to translate from one to another and to repair bad code.

Dave Ragget, the orginal author of Tidy, summarizes the program's capabilities: “Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. Each item found is listed with the line number and column so that you can see where the problem lies in your markup. Tidy won't generate a cleaned up version when there are problems that it can't be sure of how to handle. These are logged as "errors" rather than "warnings".” (Clean up your Web pages with HTML TIDY, retrieved 18:16, 4 September 2009 (UTC)).

By default, tidy will try to repair broken code by guessing the HTML version. For advanced users, there are many options that you can consult either in the reference, than man page or the help switch:

If you are not familiar with documentation of command line tools, these will be difficult reading, there you should read [http://tidy.sourceforge.net/docs/Overview.html Clean up your Web pages with HTML TIDY] before.

The W3C series of online quality assurance tools

  • MarkUp Validator - Also known as the HTML validator, it helps check Web documents in formats like HTML and XHTML, SVG or MathML.
  • Link Checker - Checks anchors (hyperlinks) in a HTML/XHTML document. Useful to find broken links, etc.
  • CSS Validator - validates CSS stylesheets or documents using CSS stylesheets.

In addition, server administrators could install the Log Validator, i.e. a local crawling engine that will analyse the quality of a website with the help of various processing modules, e.g. the three tools above.

Other tools

Some high-end web authoring systems may include other good tools, but we shall not discuss these here for now ...

Past and future of web standards

Standard vs. implementation

Development of web standards can be described as being fairly chaotic at times. If we interpret "standard" like many (misguided) decision makers and web developers "standard" becomes a synonym of "implemented" and (worse) "pratise".

A very typical example was the summer 2009 discussion about IE 8 that fixed bad CSS implementation mistakes of IE7. Some said that IE 8 wasn't compliant with the "IE 7 standard". Of course it's compliant with standards that IE 7 did render. IE 8 just renders code as intended, i.e. webdevelopers don't have to create wrong IE-specific CSS mistakes anymore ...

On a side note, orginally Microsoft wanted IE8 to behave like IE7 per default (i.e. simulate IE7's bugs), but they then changed their mind. The release version now allows developers to insert a meta tag in their web pages and that will trigger IE7 behavior as opposed to the default IE8 "standards mode" (IE8 Compatibility mode, Wikipedia).

<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />

A related issue concerns introduction of proprietary extensions for which both Microsoft and Netscape were famous. The list of wild extensions is fairly long:

  • Early tags like the infamous blink
  • DHTML (both Netscape and Microsoft) had their own. DHTML is now standardized as a combination of [[CSS], DOM and JavaScript
  • New languages that pushed too hard before being standardized, e.g. Microsoft's

So please, when you talk about standards, make a distinction between standard and implementation. Do never use browser-specific extensions or hacks (unless you have the intention and the money to repair your code once a new browser version or product hits the market.

On the other hand, it is perfectly rational and ethical to produce "niche contents" that are standardiszed or use at least "industry-accepted" formats and that only work with a set of given browsers, for example

  • SVG, a standard for vector graphics (Firefox or Opera)
  • SMIL, a standard for multimedia animation and synchronisation (known as HTML+TIME in IE)
  • Flash, a proprietry vector graphics format and associated scripting language for multimedia animation and interaction and Rich internet application (RIA) platform (most browsers)
  • Java, a proprietary (but open source) computer language specification for RIAs (i.e. applets)
  • XHTML served as XHTML with the purpose to include other XML vocabularies. (Firefox, Opera, Safari, etc. but not IE8).
  • etc...

But think before you do this ! Could we provide similar functionality using formats that display in all sorts of web browers and at what cost ?.

Evolution of HTML

In the history of HTML, at least two proposed standards "did not make it":

  • HTML 3.0 (1995)
  • XHTML 2 (2009)

Also, sometimes specifications change between draft and the final version. E.g. Microsoft implement very early XSLT and then had to make a few modifications later. I.e. code that work in IE 5.5 (earlier versions) was broken in IE 5.5 later versions and IE6. Therefore, you should be careful adapting a standard befor it is finalized and before it seems to be accepted by a larger community (in particular the browser "vendors").