Validating HTML with tidy

If you ever have to do HTML validation or parsing in PHP the tidy extension is the way to do it! This extension lets you use the abilities of tidy in some pretty powerful ways. The extension, written by John Coggeshall, has been around for several years now. I can see how if someone just took a quick glance at it they could think it was nice, but not really something they need. How wrong they would be! If you take a few minutes and look under the hood, tidy is an extremely powerful tool. Not only can it format html to standards (what most people use it for), it can also be a powerful parser and validation tool.

When I’m dealing with user inputted data where I want to allow HTML I have two concerns. First, I don’t want to allow XSS (some xml parsers think <p kkk=”></p>” closes the <p> tag). Second, the user frequently enters invalid html (e.g., doesn’t close the <a> tag). Fortunately tidy can easily deal with both. The second issue is the easiest to solve by running tidy->cleanRepair() on the html. The first is taken care of by looping through the tidy nodes and rebuilding the html using a whitelist. More about how to do this after the break. Read More…

Posted under PHP, Security, Tips & Tricks, Web Development

This post was written by Michael Tougeron on January 15, 2009

Tags: , , , , ,