HTML Purifier Drupal Module

Note: I transfered ownership of this module to Edward Z. Yang, the author of HTMLPurifier. This module is now maintained on the drupal.org infrastructure. For more information see the HTMLPurifier project page.

Description

This Drupal module allows HTML Purifier to be used as a filter in an input format. HTML Purifier removes malicious HTML code and ensures that output is standards compliant.

Most WYSIWYG editors require either that the "full HTML" input format is used or a format with a large set of allowed tags, including some dangerous ones. Because the default HTML filter has only limited knowledge about HTML, this possibly allows for XSS attacks.

HTML Purifier has two main advantages:

  • Security: Because of its thorough knowledge of HTML, it can filter out a lot more malicious code.
  • Standards Compliant: It ensures that output is valid (X)HTML, user input can no longer mess up the validation of your page.

Compatibility

This module has been tested with the following other modules:

  • BBCode: Code generated by BBCode will pass through the HTML Purifier unmodified, with one exception: Email address encoding needs to be disabled. Email address encoding will use javascript to display email addresses and javascript is blocked by HTML Purifier for obvious reasons.
  • ...

More modules need to be tested, especially WYSIWYG editors (HTMLArea, FCKeditor, TinyMCE). If you have done so, please tell me about your experiences.

Download

You can download the current development version from the SVN repository. This code does not include the HTML Purifier code, you will need to download that separately. See the included INSTALL.txt for details.

The latest version of this module can be downloaded from the HTMLPurifier project page.

License

GPL

Comments

As I told you by email I tried your module with the DRUPAL-4-7 branch with tinymce. I haven't tested so much but I get sometimes this kind of PHP error:

DOMDocument::loadHTML() [function.loadHTML]: Opening and ending tag mismatch: div and b in Entity, line: 1 in
/www/modules/htmlpurifier/library/HTMLPurifier/Lexer/DOMLex.php on line 57.

Another thing is that this filter is only for new nodes, comments. What's about the search box, contact field,... ? I'll think about enabling htmlpurifier globally.


As I told you by email I tried your module with the DRUPAL-4-7 branch with tinymce. I haven't tested so much but I get sometimes this kind of PHP error:
DOMDocument::loadHTML() [function.loadHTML]: Opening and ending tag mismatch: div and b in Entity, line: 1 in
/www/modules/htmlpurifier/library/HTMLPurifier/Lexer/DOMLex.php on line 57.

This is a warning message from PHP. HTML Purifier tries to suppress it but there is a bug in drupal 4.7 which causes it to show the message anyway.
I think you can fix it by editing the error_handler function in includes/common.inc. This change should fix it: http://cvs.drupal.org/viewcvs/drupal/drupal/includes/common.inc?r1=1.548&r2=1.549
In the 4.7.6 release this fix will be includes as well.

Once the page is cached, html purifier isn't invoked so the error doesn't occur.

Another thing is that this filter is only for new nodes, comments. What's about the search box, contact field,... ? I'll think about enabling htmlpurifier globally.

You can use it for old nodes as well, once you add htmlpurifier to an input format, all existing nodes with that input format will use htmlpurifier. Input formats are used when a node is displayed, not when it is saved.

Search and contact only allow plain text to be used, they will convert all HTML to plain text so there is no need for html purifier there.

The module currently covers most of the user contributed content, except for the places where no input formats are used. There are indeed modules that don't use input formats but implement their own filtering mechanism. The aggregator module comes to mind as one such module.
I don't know if there is a way to pass the output from these modules through html purifier as well, aside from changing the theme functions but thats just ugly.

I didn't find your module in the module list on drupal.org:
http://drupal.org/project/Modules

could you please add it there?

Will do once its really usable.

Currently this filter acts on-output but HTML Purifier is way to slow to run it on any decent-sized site like this.
It works fine for blogs like this where pretty much all posts and comments are in the filter cache, but if you have a larger site that is constantly being indexed by crawlers, this will kill your server. Believe me, I tried ;)

I'm still looking for the best/easiest way to do on-input filtering. I guess I can use form validation but i'm not sure how I can detect which fields need to be filtered and which don't. Filtering node->body is a start, but with modules like CCK, that probably isn't enough. I'll ask the drupal gods once I have some more time to work on this. Suggestions are welcome of course.

Hi Bart, I don't know if you're still reading comments on this post or not. If you are, and are still working on this project, you might want to take a look at http://drupal.org/project/safehtml and see if you can use this module as a model of how to html purifier working on content before it's put into the database.

If not, well I may give it a go. I'd like to try Drupal for a project I'm working on, and have already decided to use the YUI Editor module for it. My next step would be to get html purifier working to filter input before it's put into the database.

Thanks for the suggestion. I actually implemented the feature in the exact same way a few weeks ago (code that is currently in SVN). So you may want to check that out. Basically what it does is create an input filter just like one would do with output filters. When a new node is submitted, the module checks if the input format contains the html purifier filter and if so, filters the input before it is inserted into the database.

I had some difficulties implementing the same for comments (the comment hook isn't as powerful as nodeapi), so currently this only works for the body and teaser fields of nodes. I plan to add support for CCK textareas in the future as well.

I'm interested in this module because of the input filter difficulties I've been having getting various editors to work. However, I am not a programmer. Is this module's current state such that someone such as myself would be able to set up and use this effectively or is it still in the testing stages? If it wouldn't be appropriate for me, what suggestions might you have in terms of other modules?

It works for plain old nodes, but not for comments or any other content types. So for now i'd say its still in testing.
Progress is slow because the project for which I needed a module like this was canceled, so there isn't that much motivation anymore.

Mind you I got a 9 week old baby and have several huge projects going on at work, so I don't have a lot of time. However, I'm working on my family website and want to use drupal and htmlpurifier for everything. So, I'll grab what you got in svn and see if I can get it to work for comments and such as well. I'm trying to use either YUI or ExtJS RTE's for blogs, comments, and other features and really want to have html purifier in place for it. Since it's a personal project, I'll have the drive. I have a few other things I want to get done first on the site though, so I'll add it to my list.

I installed it in my my machine for Drupal, enabled the module, added it as an input format and tried it out by writing a new story but this is the error message I got.

  1. user warning: Value not supported, valid values are: Serializer in D:\Web\xampp\htdocs\drupal-5.5\modules\htmlpurifier\library\HTMLPurifier\Config.php on line 232.

Any idea as to what's wrong?

FYI, others looking for a solution to this issue should have a look at http://drupal.org/node/203642

Which code did you use? The one from drupal.org or the one from my SVN repository?

You should use the one from drupal.org, it has the latests updates, some of which I hadn't commited to SVN yet. If the problem still occurs with that code, create an issue on drupal.org. Edward has a far better knowledge about the htmlpurifier internals than I do (he's the one who wrote it after all).

Thanks for the auspicious writeup. It actually was
a leisure account it. Look complex to far brought agreeable from you!
However, how could we communicate?

Is anyone developing a module using htmLawed instead of HTMLPurifier? Someone should! htmLawed is only one file and is just a tenth in size and memory consumption.

Someone should! htmLawed is only one file and is just a tenth in size and memory consumption.

drupal htmlawed module -- uses nodetype-specific values. great!

I was wondering if the module you developed works with the Geshi filter module. I use this on my Visual Basic Source site in order to display the VB6 code for my samples and tutorials. Just curios if anyone has tried this or not.

Also I was wondering why this doesn't show up on the drupal.org page? Have you just not got around to it or is it not officially supported?

There are official releases on the project page, but the module is now maintained by ezyang instead of me.

I never tried it in combination with Geshi, but I think it should work, all it does is insert some style tags. Its typically modules that inject javascript code that cause problems.

Is there any comparison of HTMLPurifier and htmLawed?
They both look like a good and useful modules, but I don't know which one I should use.

I'm setting my new Drupal-based site now and thinking about safety-related add-ons, but the amount of information I have is definitely not enough.

Best regards,
Dave

It really depends what you need. HTMLPurifier does a whole lot more, not just filtering out malicious codes but also ensuring standards compliance.

There is a comparison (albeit pretty biased) here: http://htmlpurifier.org/comparison.html

The main problem with HTMLPurifier is that all the validation it does, comes at a pretty high performance cost. It heavily relies on caching, which is why I'm not using it on a large site. I'm afraid the server will die when the cache gets flushed :)

Hey Bart,
I know you don't maintain this module anymore but I also know you have responded to comments in the past so I'm hoping maybe if you get a chance you will respond to this one :-)

You mention above that HTMLPurifier comes at a pretty high performance cost - especially on a large site. What do you mean by a large site if you don't mind me asking. For instance:

I have a Visual Basic 6.0 site that I post programming tutorials and source code samples on it. It is an older programming language but I am still amazed at how many people make use of it. I started the site just to keep some old articles I had written around and over the years it has grown pretty substantially. Especially once Microsoft quit supporting the VB6 programming language. I have developers from all over the world coming to it. I also allow others to post their VB related tutorials and source - however I would love to make sure that any HTML they post is not only safe but is also well formed so I don't have to spend time reformatting it. My only concern is that since the site has grown quite a bit things will get bogged down and I will either have too slow of pages or be required to pay for a bigger hosting plan.

Thus the reason I am asking this question... what do you think? How large is too large for me to use this great module?

Thanks for taking the time to read this if you get a chance. Or if anyone else has insight I would love to hear.

Hi Matt,
The main performance impact is that this module run as part of the input format filters. This means that all database content using this filter format, needs to be processed when viewed by a user. Most filters are fairly simple so this isn't really noticeable, but last I tried this filter doubled the time required to generate the page.
Of course, the filter output is cached, so the next time someone views the page (in the same day), there is no performance impact. If you have only a simple brochure-like website, that's not really an issue because everything will be in cache (and you could extend the cache lifetime).

If you have hundreds or thousands of nodes, there will always be visitors hitting pages that are not in the filter cache. Sometimes because they are directed there by search engines, or because the "visitor" is really just a crawler. This will consume a lot of CPU power as well.
A better solution for such a website would be to run the node content through the filter when the node is added.
I used BBCode formatting instead for the same purpose, many people already know some BBCode tags from forums and it is a lot easier to filter than html.

Wow that is a great explanation. Thanks for explaining it in such detail. I don't think I will be able to make use of it on my page as it gets 100s of thousands of page views per day. I am still excited to try it out however. I have created a site for my wife where she posts her good craft ideas. This site gets far fewer people so it should be safe.

Thanks again for all your help.

Hiya very cool website!! Man .. Beautiful .. Amazing ..
I will bookmark your blog and take the feeds also? I'm happy to search out numerous useful information here within the publish, we need develop more strategies in this regard, thank you for
sharing. . . . . .

Add new comment

Subscribe to Comments for "HTML Purifier Drupal Module"