Sunday, January 20, 2008

Cleaning xHTML markup with PHP Tidy

Everyone makes mistakes. Even the best xHTML coders will sometimes write invalid xHTML. Not to worry, PHP can automatically clean up xHTML before display using the PHP Tidy Extension.

PHP Tidy uses the Tidy Parser. Tidy, is ported to many programming languages, and allows the language to clean up XML documents. It works well for xHTML.

In PHP5, the tidy extension is a default extension, however, in PHP4 you will need to download the Tidy PHP4 extension and compile the PHP executable with Tidy support.

How to use Tidy in PHP is documented here. Here is some examples of what Tidy can do.

Example use of Tidy in PHP

For code portability/distribution its necessary to first check if the tidy extension is available on your PHP version. You can do this by querying the existence of the tidy functions or classes (among other methods). So first you check if Tidy support is availalbe:

if (function_exists('tidy_parse_string')) {
// do your tidy stuff
}
Then comes the tidying. For simplicity, I'll use the single PHP Tidy function, 'tidy_repair_string'.

// Specify configuration
$config = array(
 'indent'         => true,
 'output-xhtml'   => true,
 'wrap'           => 200);
// Specify encoding
$encoding = 'utf8';
// repair HTML
$html = tidy_repair_string($html, $config, $encoding);

This works for both PHP4 and PHP5. PHP5 also supports an OO syntax.

Example Implementation: PHP Tidy Plugin for Joomla

Here is how I implemented the PHP Tidy Plugin into Joomla.

Joomla is a Content Management System, thus you cannot directly control the xHTML that will go into your articles. Some of your users may not be very xHTML savvy. The main reason I implemented Tidy is to clean content inserted automatically from feeds - which you have absolutely no control over.

A Joomla Plugin implements a basic Observer Pattern into Joomla. Functions are registered as observers, which are triggered during certain events. One such event is the preparation of content for display. The tidy plugin thus registers as a handler of content preparation. It then passes all content through the tidy parser, and returns the clean xHTML to Joomla.

The Joomla Tidy Plugin Code


/**
* @copyright Copyright (C) 2007 Fiji Web Design. All rights reserved.
* @license http://www.gnu.org/copyleft/gpl.html GNU/GPL
* @author gabe@fijiwebdesign.com
*/

// no direct access
defined( '_VALID_MOS' ) or die( 'Restricted access' );

// register content event handlers
$_MAMBOTS->registerFunction( 'onPrepareContent', 'bot_tidy' );

/**
*  Tidy up the xHTML of your content
*/
function bot_tidy( $published, &$row, &$params, $page=0 ) {
 
 if ($published) {
  // get the plugin parameters
  //$botParams = bot_tidy_getParams('bot_tidy');

  if (isset($row->text) && $row->text) {
   $row->text = bot_tidy_parse($row->text);
  }

 }
 return true;
}

/**
* Parses a string with tidy taking into consideration the Joomla encoding
* @param String xHTML
*/
function bot_tidy_parse($html) {
 if (function_exists('tidy_parse_string')) {
  
  // Specify configuration
  $config = array(
       'indent'         => true,
       'output-xhtml'   => true,
       'wrap'           => 200);
  // get Joomla content encoding
  $iso = split( '=', _ISO );
  $encoding = '';
  $jos_enc = str_replace('-', '', $iso[1]);
  if (in_array($jos_enc, array('ascii', 'latin0', 'latin1', 'raw', 'utf8', 'iso2022', 'mac', 'win1252', 'ibm858', 'utf16', 'utf16le', 'utf16be', 'big5', 'shiftjis'))) {
   $encoding = $jos_enc;
  }
  
  // Tidy
  $html = tidy_repair_string($html, $config, $encoding);
  
  return $html
  ."\r\n"
  ;
 } else {
  return $html
  ."\r\n"
  ;
 }
}

Here is the tidy plugin for Joomla.

Tidy is great for Content Management Systems where content is contributed by users with differing levels of xHTML knowledge. It is also necessary if you want content from RSS feeds to pass W3C validation (if they contain xHTML like the Google News Feeds). I've noticed however, that PHP Tidy does not always create valid xHTML content. It does however create valid XML every time. This is yet to be explored further as I have just released Joomla Tidy Plugin for Alpha testing.

No comments: