HTML and XHTML validation and repair

From EduTech Wiki
Jump to: navigation, search

1 Introduction

Learning goals
  • Learn why standards are important and why web pages should comply with standards
  • Be able to validate HTML, find broken links and validate CSS
  • Be able to fix broken pages (i.e. understand error messages and use a repair tool)
Moving on
Level and target population
  • Beginners
  • This is a first version ...

1.1 The executive summary

There are several types of web page validation tools:

  • HTML/XHTML code validation
  • CSS (style sheet) validation
  • Links checking (are they broken ?)
  • Accessibility checking (can impaired people use the web site)

Validation tools are available in different ways:

1.2 Why should you care about valid code ?

W3C's My Web site is standard! And yours? propaganda page delivers the main arguments (quoted fragments are citations from this piece written by Karl Dubost)

  • Designing with standards will simplify Web site code maintenance because you will not have multiple versions for different browsers. Your pages will have a longer life and will not be dependent upon vaporous technologies.
  • “Technical constraints exist with any artistic medium, whether you are drawing, sculpting, or designing Web pages. Watercolors or oil paintings have their own constraints, but these techniques do not to block creativity, rather they provide structure for creative expression.” Have a look at the various designs at CSS Zen Garden that shows off 210 different cool designs that work with exactly the same XHTML page.
  • “People with disabilities represents 8% to 10% of the total population. It's easier to maintain a Web site that follows accessibility guidelines (and therefore Web standards). Your Web site traffic will increase, and a wider variety of browsers will have access to site content”. Just an example: You may not care about blind people that use speech synthesizers to listen to web contents, but you want to use such a tool yourself in your car sometimes in the near future. Also you may have a cell phone and want to be able to look at the same web contents in a more linear way.
  • “Standards have been designed to keep in mind all potential audiences and technologies. You will not be attached to any company or proprietary technology. You can use technologies that are independent of platforms requirements.”. E.g. valid HTML does run in the 2007/9 bunch of new browsers like Safari and Chrome.

The same article also provides some extra advise:

  • “Unfortunately, many books do not teach good Web programming. When you are creating a Web site, you should check the correctness of your markup. If you are a Web developer, be careful using books to develop your application and read the particular specifications which you are trying to implement.”
  • “Many authoring tools do not generate valid markup. Some have syntax checkers embedded into them, others do the right thing, and many do not generate valid markup. As an intermediate solution, you have to check your Web page with an HTML validator.”
  • May CMSs (e.g. templates uses or generators) produce bad code. There isn't much you can do about this, except complaining to the people who produce these.

Basically, what Karl Dubost is saying is that you must care about validity and you must not trust your favorite tools or even published books. You do have to learn how to validate code with independent validating tools.

2 The toolbox

2.1 SGML and XML validation

Since HTML code is SGML and XHTML is XML, standard SGML and XML parsers can validate the syntax of (X)HTML pages by comparing it to a DTD. For example, a validating parser can find misspelled and illegal tags, find misplaced tags (e.g. a "p" within a "ul"), identify missing end tags and quotes for attributes etc.

Such tools can't find mistakes that relate to informally spelled out specifications (as opposed to more recent standards that use more powerful schema languages like XML Schemas).

Typically, complex text editors that programmers use have built-in tools (often via extensions) that validate SGML and XML languages. XML editors can validate XHTML (and other XML languages like SVG, but not HTML.

IMHO, that kind of validation is good enough for most educational sites - Daniel K. Schneider 18:16, 4 September 2009 (UTC).

2.2 The tidy program

The tidy program is the most well know validation and repair program. Since tidy is an open source library, it is also embedded in many authoring environments as well as browser extensions.

(1) The easiest way to use tidy is to install a Firefox extension called related Firefox (en_US) page. If this extension is installed and activated (by default it is), you will see a warning or error icon on the bottom right for each page you read. You then can look at the warnings by clicking on the icon or by selecting View->Source. For beginners, this may be enough. Carefully look at the warnings in Firefox source window, repair in the editor, save the file, then reload in the Web browser.

(2) Next, you could download the Tidy program from Sourcefore. The website itself does not provide binaries, but will point to others who do. E.g. for Windows as of Sept 2009, the procedure is the following:

  • Go to HTML Tidy for Windows
  • Download the EXE version. You will get a *.zip file called
  • Extract the single tidy.exe to some directory, e.g. c:\bin\
  • You then may (optionally) Set the Path Environment variable so that you can type the tidy command in any Window your open. If this sounds too complicated for you, move to the next point.

(3) There exist graphical user interfaces to tidy which you may use in addition (or alternatively). E.g. GUI for Tidy by Dirk Paehl. It is distributed as a zip file. Extract all files to a single directory. It will run from any device (e.g. a memory stick) and does interfere with the operating system, e.g. MS Window's registry. As an alternative, consider the next point.

(4) Many open source HTML/text editors do have an interface to tidy, but you may need to tell the program where to find this software. Some editors directly repair your code (see below), therefore you should make a copy before using tidy! E.g. in NoteTab Light, Use menu Tools->Tidy HTML code (or hit CTRL-F7). It will automatically fix errors and then display errors and warnings prior to the repair.

2.3 Fixing code with Tidy

Tidy is a fairly complex program that allows not only to validate various HTML dialects, but to translate from one to another and to repair bad code.

Dave Ragget, the original author of Tidy, summarizes the program's capabilities: “Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. Each item found is listed with the line number and column so that you can see where the problem lies in your markup. Tidy won't generate a cleaned up version when there are problems that it can't be sure of how to handle. These are logged as "errors" rather than "warnings".” (Clean up your Web pages with HTML TIDY, retrieved 18:16, 4 September 2009 (UTC)).

By default, tidy will try to repair broken code by guessing the HTML version. For advanced users, there are many options that you can consult either in the reference, than man page or the help switch:

If you are not familiar with documentation of command line tools, these will be difficult reading, there you should read Clean up your Web pages with HTML TIDY before.

2.4 The W3C series of online quality assurance tools

  • MarkUp Validator - Also known as the HTML validator, it helps check Web documents in formats like HTML and XHTML, SVG or MathML.
  • Link Checker - Checks anchors (hyperlinks) in a HTML/XHTML document. Useful to find broken links, etc.
  • CSS Validator - validates CSS stylesheets or documents using CSS stylesheets.

In addition, server administrators could install the Log Validator, i.e. a local crawling engine that will analyze the quality of a website with the help of various processing modules, e.g. the three tools above.

2.5 Other tools

Some high-end web authoring systems may include other good tools, but we shall not discuss these here for now ...

3 Past and future of web standards

3.1 Standard vs. implementation

Development of web standards can be described as being fairly chaotic at times. If we interpret "standard" like many (misguided) decision makers and web developers "standard" becomes a synonym of "implemented" and (worse) "practice".

A very typical example was the summer 2009 discussion about IE 8 that fixed bad CSS implementation mistakes of IE7. Some said that IE 8 wasn't compliant with the "IE 7 standard". Of course it's compliant with standards that IE 7 did render. IE 8 just renders code as intended, i.e. web developers don't have to create wrong IE-specific CSS mistakes anymore ...

On a side note, originally Microsoft wanted IE8 to behave like IE7 per default (i.e. simulate IE7's bugs), but they then changed their mind. The release version now allows developers to insert a meta tag in their web pages and that will trigger IE7 behavior as opposed to the default IE8 "standards mode" (IE8 Compatibility mode, Wikipedia).

<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />

A related issue concerns introduction of proprietary extensions for which both Microsoft and Netscape were famous. The list of wild extensions is fairly long. Here are just examples.

  • Early tags like the infamous blink
  • DHTML (both Netscape and Microsoft) had their own. DHTML is now standardized as a combination of CSS, DOM and JavaScript
  • New languages that were pushed too hard before they had a chance of being standardized, e.g. Microsoft's WMF.

HTML5 is also a fairly chaotic process. However, instead of seeing companies fighting each other, stakeholders (i.e. a few selected large companies) agree at least on what will go into the standard. However, in the meantime, using more advanced HTML 5 features means using a longer feature detection script (how ugly can it get ?).

When referring to standards, make a distinction between standard and implementation. Do never use browser-specific extensions or hacks (unless you have the intention and the money to repair your code once a new browser version or product hits the market.

On the other hand, it is perfectly rational and ethical to produce "niche contents" that are standardized or use at least "industry-accepted" formats and that only work with a set of given browsers, for example

  • SVG, a standard for vector graphics (Any modern browser, only Firefox or Opera for older versions)
  • SMIL, a standard for multimedia animation and synchronization (known as HTML+TIME in IE)
  • Flash, a proprietary vector graphics format and associated scripting language for multimedia animation and interaction and Rich internet application (RIA) platform (most browsers)
  • Java, a proprietary (but open source) computer language specification for RIAs (i.e. applets). Must be installed manually.
  • XHTML served as XHTML with the purpose to include other XML vocabularies. (Firefox, Opera, Safari, etc. but not IE8/9).
  • etc...

But think before you do this ! Could we provide similar functionality using formats that display in all sorts of web browsers and at what cost ?.

3.2 Evolution of HTML

In the history of HTML, at least two proposed standards "did not make it":

  • HTML 3.0 (1995)
  • XHTML 2 (2009)

Also, sometimes specifications change between draft and the final version. E.g. Microsoft implement very early XSLT and then had to make a few modifications later. I.e. code that work in IE 5.5 (earlier versions) was broken in IE 5.5 later versions and IE6. Therefore, you should be careful adapting a standard before it is finalized and before it seems to be accepted by a larger community (in particular the browser "vendors").