Validating HTML with tidy

Author Michael Tougeron on January 15, 2009

Posted under PHP, Security, Tips & Tricks, Web Development and tagged with , , , , ,

If you ever have to do HTML validation or parsing in PHP the tidy extension is the way to do it! This extension lets you use the abilities of tidy in some pretty powerful ways. The extension, written by John Coggeshall, has been around for several years now. I can see how if someone just took a quick glance at it they could think it was nice, but not really something they need. How wrong they would be! If you take a few minutes and look under the hood, tidy is an extremely powerful tool. Not only can it format html to standards (what most people use it for), it can also be a powerful parser and validation tool.

When I’m dealing with user inputted data where I want to allow HTML I have two concerns. First, I don’t want to allow XSS (some xml parsers think <p kkk=”></p>” closes the <p> tag). Second, the user frequently enters invalid html (e.g., doesn’t close the <a> tag). Fortunately tidy can easily deal with both. The second issue is the easiest to solve by running tidy->cleanRepair() on the html. The first is taken care of by looping through the tidy nodes and rebuilding the html using a whitelist. More about how to do this after the break.

To start, you’ll need to setup your tidy options. I tend to use the following options. The word-2000 option is great because it strips out all of the annoying html, css, etc. that Microsoft Word inserts for its formatting.

  1. $config = array(‘indent’ => false,
  2.   ‘output-xhtml’ => true,
  3.   ‘wrap’ => 0,
  4.   ‘fix-uri’ => true,
  5.   ‘word-2000’ => true,
  6.   ‘show-body-only’ => true,
  7.   ‘drop-proprietary-attributes’ => true,
  8.   ‘ncr’ => false,
  9.   ‘drop-empty-paras’ => false,
  10.   ‘hide-endtags’ => true,
  11.   ‘lower-literals’ => true,
  12.   ‘markup’ => true,
  13.   ‘quote-ampersand’ => true,
  14.   ‘force-output’ => true);

Add/Remove these options as it works best for you. I’ve found that these work best for the cleaning & repairing of accidentally invalid HTML. Read the tidy docs for what each of these mean and how they affect what tidy does. Please note that in the above options I include “force-output” so that I can display back to the user what the results of trying to fix the HTML is so that they can fix their input.

If you’re parsing user generated content, usually they are just entering the tags and HTML for the specific content they want posted. This means you’ll need to wrap that input in the <!DOCTYPE>, <HTML> and <BODY> tags before processing it through tidy.

  1. // Let’s initialize tidy with our HTML and config options.
  2. $tidy = tidy_parse_string($html, $config);
  3. // Not let’s clean &amp; repair to try and "fix" user errors
  4. $tidy->cleanRepair();
  5. // Start tracking any errors at this point so we can give good feedback to the user
  6. if (tidy_error_count($tidy) &amp;&amp; $tidy->errorBuffer ) {
  7.   $errors = explode("\n", $tidy->errorBuffer);
  8.   foreach ( $errors as $key => $error ) {
  9.     $errors[$key] = htmlentities($error);
  10.   }
  11. }
  12.  
  13. // Once tidy has cleaned &amp; repaired the initial user input
  14. // we need to loop through the object and validate each
  15. // HTML tag block.  We will only need to do that though if
  16. // the $body actually has child tags
  17. $body = $tidy->body();
  18. if ( $body->child ) {
  19.   // we need to validate each HTML tag in the <body>
  20.   // validateTidyNode() will be called recursively for
  21.   // any sub-tag blocks
  22.   foreach ( $body->child as $key => $child ) {
  23.     $error = validateTidyNode($child, $new_html);
  24.     $body->child[$key] = $child;
  25.     // And of course track the errors as they
  26.     // appear to give good feedback to the user
  27.     if ( $error ) {
  28.       $errors = array_merge((array)$errors, (array)$error);
  29.     }
  30.   }
  31. }
  32. return array(‘html’ => $new_html, ‘errors’ => $errors);
  33.  
  34. function validateTidyNode(&amp;$tidy_node, &amp;$html, $allowed_tags) {
  35.   $bad_tag = false;
  36.   // If the node is text, just add the text to the HTML
  37.   if ( $tidy_node->type == TIDY_NODETYPE_TEXT ) {
  38.     $html .= $tidy_node->value;
  39.     return;
  40.   }
  41.   // If the node does not exist in your array of acceptable HTML tags
  42.   // then track and error message and set the $bad_tag flag.
  43.   elseif ( !array_key_exists(strtolower($tidy_node->name), $allowed_tags) ) {
  44.     $errors[] = ‘Tag: ‘ . $tidy_node->name . ‘ is not allowed and has been removed.’;
  45.     $bad_tag = true;
  46.   }
  47.   if ( $tidy_node->child ) {
  48.     $html2 = ;
  49.     foreach ( $tidy_node->child as $key => $child ) {
  50.       $error = validateTidyNode($child, $html2);
  51.       $tidy_node->child[$key] = $child;
  52.       if ( $error ) {
  53.         $errors = array_merge((array)$errors, (array)$error);
  54.       }
  55.     }
  56.   }
  57.   if ( $bad_tag ) {
  58. // I prefer to "display" the bad HTML tags so that the
  59. // user can see that it was not accepted.  But you
  60. // may prefer to just strip the tag.
  61. //    $html .= ‘ ‘;
  62.     $html .= ‘&amp;lt;’ . $tidy_node->name;
  63.   }
  64.   else {
  65.  
  66.     $html .= ‘<‘ . $tidy_node->name;
  67.     if ( $tidy_node->attribute ) {
  68.       unset($found_attribs);
  69.       foreach ( $tidy_node->attribute as $attrib_name => $attrib_value ) {
  70. // verify that the tag is allowed to have the specified attribute.
  71.         if ( !array_key_exists(strtolower($attrib_name), $allowed_tags[$tidy_node->name][‘attribs’]) ) {
  72.           $errors[] = ‘Tag ‘ . $tidy_node->name . ‘ is not allowed to have the attribute ‘ . $attrib_name . ‘ and has been removed.’;
  73.           unset($tidy_node->attribute[$attrib_name]);
  74.         }
  75.         else {
  76. // validate the attribute’s value.  We don’t want invalid values.
  77.           $res = validateHTMLTagAttribute($tidy_node->name, $attrib_name, $attrib_value, $allowed_tags[$tidy_node->name]);
  78.           if ( !$res[‘remove’] ) {
  79.             $html .= ‘ ‘ . $res[‘attrib_name’] . ‘="’ . $res[‘attrib_value’] . ‘"’;
  80.           }
  81.           if ( $res[‘errors’] ) {
  82.             $errors[] = $res[‘errors’];
  83.           }
  84.         }
  85.         if ( trim($attrib_name) ) {
  86.           $found_attribs[] = $attrib_name;
  87.         }
  88.       }
  89.  
  90.     }
  91. // check to make sure that required attributes are set.  e.g. <img> needs to have the src attribute
  92.     if ( $allowed_tags[$tidy_node->name][‘attribs’] ) {
  93.       foreach ( $allowed_tags[$tidy_node->name][‘attribs’] as $attrib_name => $attrib_settings ) {
  94.         if ( $attrib_settings[‘required’] &amp;&amp; (!$found_attribs || !in_array($attrib_name, $found_attribs)) ) {
  95.           $errors[] = ‘Tag ‘ . $tidy_node->name . ‘ is required to have the attribute ‘ . $attrib_name . ‘.’;
  96.         }
  97.       }
  98.     }
  99.  
  100. // some tags require at least one of a set of attributes.  e.g. <a> needs to have either the href or name attribute.
  101.     if ( $allowed_tags[$tidy_node->name][‘required_attribs’] ) {
  102.       $attrib_found = false;
  103.       if ( $found_attribs ) {
  104.         foreach ( $allowed_tags[$tidy_node->name][‘required_attribs’] as $attrib_name ) {
  105.           if ( in_array($attrib_name, $found_attribs) ) {
  106.             $attrib_found = true;
  107.           }
  108.         }
  109.       }
  110.       if ( !$attrib_found ) {
  111.         $errors[] = ‘Tag ‘ . $tidy_node->name . ‘ is required to have one of the following attributes: ‘ . implode(‘, ‘, $allowed_tags[$tidy_node->name][‘required_attribs’]) . ‘.’;
  112.       }
  113.     }
  114.  
  115.   }
  116.   if ( isset($allowed_tags[$tidy_node->name][‘settings’]) &amp;&amp; !$allowed_tags[$tidy_node->name][‘settings’][‘require_close’] ) {
  117.     $html .= ‘ /’;
  118.   }
  119.   if ( $bad_tag ) {
  120. //    $html .= ‘ ‘;
  121.     $html .= ‘&amp;gt;’;
  122.   }
  123.   else {
  124.     $html .= ‘>’;
  125.   }
  126.   $html .= $html2;
  127.   if ( $bad_tag || $allowed_tags[$tidy_node->name][‘settings’][‘require_close’] ) {
  128.     if ( $bad_tag ) {
  129.       $html .= ‘&amp;lt;/’ . $tidy_node->name . ‘&amp;gt;’;
  130.     }
  131.     else {
  132.       $html .= ‘</’ . $tidy_node->name . ‘>’;
  133.     }
  134.   }
  135.   return $errors;
  136. }
  137.  
  138. function validateHTMLTagAttribute( $tag_name, $attrib_name, $attrib_value, $tag_data ) {
  139.   // set to the unmodified version first.  if modified, it will change at the end of the method.
  140.   $res[‘attrib_name’] = $attrib_name;
  141.   $res[‘attrib_value’] = $attrib_value;
  142.  
  143.   if ( !$tag_data) {
  144.     $res[errors] = "$tag_name could not be found.  Invalid tag/attribute.";
  145.     return $res;
  146.   }
  147.  
  148.   if ( $tag_data[‘attribs’][‘unlimited_attribs’] ) {
  149.     $res[‘valid’] = true;
  150.     return $res;
  151.   }
  152.  
  153.   $attrib = $tag_data[‘attribs’][$attrib_name];
  154.   switch ( $attrib[‘type’] ) {
  155. // no restrictions on the value of the attributes (not normally recommended)
  156.     case ‘unrestricted’:
  157.       break;
  158. // the attribute can only have one of the defined values
  159.     case ‘fixed’:
  160.       if ( !in_array(strtolower($attrib_value), $attrib[‘values’]) ) {
  161.         $res[‘errors’] = "$attrib_name is not set to an accepted value for $tag_name tag.";
  162.         return $res;
  163.       }
  164.       break;
  165. // some basic numeric checks
  166.     case ‘numeric’:
  167.     case ‘px’:
  168.     case ‘percent’:
  169.       $attrib_value = intval($attrib_value);
  170.       if ( $attrib_value > $attrib[‘max’] ) {
  171.         $res[‘errors’] = "$attrib_name is greater than the max value (" . $attrib[‘max’] . ") allowed for $tag_name tag.";
  172.         return $res;
  173.       }
  174.       elseif ( $attrib_value < $attrib[‘min’] ) {
  175.         $res[‘errors’] = "$attrib_name is less than the max value (" . $attrib[‘min’] . ") allowed for $tag_name tag.";
  176.         return $res;
  177.       }
  178.       if ( $attrib[‘type’] == ‘percent’ ) {
  179.         $attrib_value = $attrib_value . "%";
  180.       }
  181.       elseif ( $attrib[‘type’] == ‘px’ ) {
  182.         $attrib_value = $attrib_value . "px";
  183.       }
  184.       break;
  185.     case ‘hex’:
  186.       // 6 characters of A-F and 0-9
  187.       // optionally allow a # in front for a total
  188.       // of 7 characters
  189.       break;
  190.     case ‘url’:
  191.       // validate according to your valid URL criteria
  192.       // e.g., only allow for current domain or if the
  193.       // tag is "img" make sure the extension is .jpg or .gif
  194.       break;
  195.     case ‘string’:
  196.       // validate for a string.  I usually consider _-. and similar
  197.       // characters acceptable for a "string" even though they are
  198.       // not truly a string
  199.       break;
  200.     default:
  201.       $res[‘errors’] = "ERROR!  Do not know how to process the tag $tag_name!";
  202.       break;
  203.   }
  204.   $res[‘attrib_name’] = $attrib_name;
  205.   $res[‘attrib_value’] = $attrib_value;
  206.   $res[‘valid’] = true;
  207.   return $res;
  208. }

So how you setup the array for tag validation is pretty much up to you. The format I used looks a lot like this:

  1. $tags = array(
  2.             ‘a’ => array( ‘settings’ => array(‘require_close’ => true),
  3.                             ‘attribs’ => array( ‘href’ => array(‘type’ => ‘url’),
  4.                                                 ‘name’ => array(‘type’ => ‘string’),
  5.                                                 ‘id’ => array( ‘type’ => ‘string’,
  6.                                                                   ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  7.                                                                 ),
  8.                                                 ‘class’ => array( ‘type’ => ‘unrestricted’,
  9.                                                                   ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  10.                                                                 ),
  11.                                                 ‘onclick’ => array( ‘type’ => ‘unrestricted’,
  12.                                                                   ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  13.                                                                 ),
  14.                                                 ‘onmouseover’ => array( ‘type’ => ‘unrestricted’,
  15.                                                                   ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  16.                                                                 )
  17.                                               ),
  18.                             ‘required_attribs’ => array( ‘href’, ‘name’ )
  19.                         ),
  20.             ‘b’ => array(‘settings’ => array( ‘require_close’ => true)),
  21.              ‘img’ => array(  ‘settings’ => array(  ‘require_close’ => false),
  22.                                 ‘attribs’ => array(   ‘src’ => array( ‘type’ => ‘url’,
  23.                                                                         ‘required’ => true
  24.                                                                       ),
  25.                                                         ‘alt’ => array(‘type’ => ‘string’),
  26.                                                         ‘border’ => array(  ‘type’ => ‘numeric’,
  27.                                                                             ‘min’ => 0,
  28.                                                                             ‘max’ => 5
  29.                                                                          ),
  30.                                                         ‘align’ => array( ‘type’ => ‘fixed’,
  31.                                                                             ‘values’ => array(  ‘center’,
  32.                                                                                                 ‘left’,
  33.                                                                                                 ‘right’
  34.                                                                                             )
  35.                                                                         ),
  36.                                                         ‘height’ => array(  ‘type’ => ‘numeric’,
  37.                                                                             ‘min’ => 1,
  38.                                                                             ‘max’ => 2048
  39.                                                                          ),
  40.                                                         ‘width’ => array(   ‘type’ => ‘numeric’,
  41.                                                                             ‘min’ => 1,
  42.                                                                             ‘max’ => 2048
  43.                                                                          ),
  44.                                                         ‘title’ => array(‘type’=>‘string’),
  45.                                                         ‘hspace’ => array(  ‘type’ => ‘numeric’,
  46.                                                                             ‘min’ => 0,
  47.                                                                             ‘max’ => 10
  48.                                                                          ),
  49.                                                         ‘vspace’ => array(  ‘type’ => ‘numeric’,
  50.                                                                             ‘min’ => 0,
  51.                                                                             ‘max’ => 10
  52.                                                                          ),
  53.                                                         ‘id’ => array( ‘type’ => ‘string’,
  54.                                                                           ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  55.                                                                         ),
  56.                                                         ‘class’ => array( ‘type’ => ‘unrestricted’,
  57.                                                                           ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  58.                                                                         ),
  59.                                                         ‘onclick’ => array( ‘type’ => ‘unrestricted’,
  60.                                                                           ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  61.                                                                         ),
  62.                                                         ‘onmouseover’ => array( ‘type’ => ‘unrestricted’,
  63.                                                                           ‘auth_level’ => AUTH_LEVEL_TRUSTED_USER
  64.                                                                         )
  65.  
  66.                                                 )
  67.                             ),
  68.  
  69.                   );

The auth_level is a filter that I have in place to only allow users with a certain amount of access, such as internal staff, to use that attribute. As you probably already know, attributes like onmouseover are an easy gateway to XSS exploits. As you add new elements to the tag rule definition you just need to update the validate function in order to validate against it.

You may have noticed that I’m using the deprecated HTML attributes instead of putting all of this in a “style” attribute. I’ve found that most users don’t know CSS styles and are familiar with the old-style HTML attributes. If you force your users to use styles you’ll have whole other validation routine to ensure exploits are not in the CSS.

Note: Please don’t just copy/paste this code into your code and use it. I’m quite positive there are bugs and typos all throughout. Oh, and no, this is not the code used by the sites I work on. It is similar but definitely not the same. 😛

Posted under PHP, Security, Tips & Tricks, Web Development

This post was written by Michael Tougeron on January 15, 2009

Tags: , , , , ,

1 Comment so far

  1. Michael Tougeron February 25, 2009 7:38 am

    I gave a talk about this topic to the GBA LAMP meetup: http://www.grepmymind.com/talks/

Leave a Comment

You must be logged in to post a comment.

More Blog Posts