by Robert Douglass on April 9, 2007 // Short URL

Drupal Input Formats and Filters

This article applies to Drupal 5.x.

Processing textual content for output in a browser is one of Drupal's most critical tasks. Without such processing we would all have to become masters at typing in HTML text! In this article I will explain what filters and input formats are, why they are important, how they are used, and why they impact the security of your site.

Filters and Input Formats

The pillars of Drupal's text handling are filters and input formats. A filter is a set of rules that can be applied to transform text in some way. Some filters strip certain HTML tags or security hazards from text. Other filters look for special patterns and expand the text in a meaningful way. Other fun-oriented filters, such as the Pirate Filter, rewrite the text altogether (in this case, to make it "talk like a pirate"). Filters know how to do one thing, and do it well; text in, filtered text out.

Some filters have extra configuration options. The HTML filter, for example, strips all but an allowed set of HTML tags from text. The set of allowed tags can be determined by the administrator.

An input format is an ordered collection of filters. Any text that is being displayed to the browser should be run through the filters in an input format first. The input format then applies all of the filters, in the right order, so that one filter feeds its output to the next, forming a chain. This chaining of filters can be the source of great flexibility as well as great confusion. The flexibility comes from the fact that filters can be made to work together, the confusion comes from the case where filters inadvertently work against each other, one filter undoing the work of the previous filter. I'll show examples of both.

Input versus Output

Drupal captures input in its raw form, saving whatever gets submitted straight to the database without alteration. Then, before displaying any such content in the browser, Drupal processes the text by choosing an input format to apply. Why doesn't Drupal apply the filters in an input format before saving input into the database? The answer is simple; flexibility. If you were to change the text that a user has input before saving it in the database, you could never get back to the original state. You could never change your mind about the configuration of the filters. By filtering on output, not on input, Drupal gives the site administrator the option of changing how content is displayed at any time. As an example, imagine that you notice the users on your site using character patterns to represent smiley faces. I know, that stuff is so 1998 :P But just for fun, let's say they're doing it ;-) You look around and find the Smiley Filter on Drupal.org, and install it. Now all of the keystroke patterns that your users had been using can be displayed as images This ability to change is only available if the input is saved verbatim and filtering is done on output.

Meet Drupal's Core Filters

Here is a rundown of the filters that Drupal ships with:

  • HTML Filter: The HTML filter is primarily responsible for removing HTML tags from text. It can be configured to allow any number of tags (whitelist) and it will remove the rest. It removes them either by stripping them, or by escaping them into entities like this: &lt;div> If tags are escaped, they show up in the output as visible tags: <div>Some text</div>. The set of tags that are allowed by default include: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>

    The final task of the HTML filter is to add a spam link deterrent to anchor tags. The deterrent, proposed by Google, gives search engines a tip about which links to follow when crawling the web. If this option is enabled, rel="nofollow" will be added as an attribute of all anchor tags.
  • Line Break Converter: This filter converts line breaks into <br> or <p> tags depending on whether a single or double line break is found. This preserves the paragraph formatting in the text that is input.
  • URL Filter: Any web or email addresses that are found in the text will be converted to clickable links, thus saving the user the hassle of having to type <a href="http://www.lullabot.com/....">
  • PHP Evaluator: The PHP Evaluator is the most radical of all Drupal's core filters. It looks for text enclosed in <?php ... ?> and evaluates it as PHP code. This effectively allows you to program and extend Drupal just by submitting content to the site! In 99% of cases, this is a bad idea, and the initial attraction of harnessing such power should be weighed by a healthy sense of fear. If you really need to write PHP code to accomplish what you're trying to do, writing a module is usually a better idea (and not that hard in most cases). Furthermore, in the wrong hands, the PHP Evaluator is an enormous security risk. A malicious attacker, with the PHP Evaluator at their disposal, could wipe out your database and take control of your web server.

Drupal's Core Input Formats

Drupal also comes with three input formats pre-defined.

  • Filtered HTML: This is the workhorse input format that is used most of the time for displaying posts such as blogs, pages, forum topics and so forth. It combines the URL Filter, the HTML Filter and the Line Break Converter in a way that allows users a small set of HTML tags for formatting while taking care of paragraphs and URLs behind the scenes. This is also the default input format for new Drupal installations. More on default input formats later.
  • PHP Code: This input format consists of only one filter, the PHP Evaluator filter. This input format is to be used when the goal is embedding PHP code in a post.
  • Full HTML: The Full HTML input format applies only the Line Break Converter filter. No HTML tags are stripped and no weblinks are converted to anchor tags.

Order Matters

When an input format consists of more than one filter, the ordering of the filters has a huge impact on what the final output is. The Filtered HTML input format has three filters, the URL Filter, the HTML Filter, and the Line Break Converter. Here is the order in which they are executed in a new Drupal installation:

Drupal Filtered HTML filters default order

Assuming the HTML Filter allows the default set of tags (see the list above), let's examine what happens to some HTML text as it gets processed by the Filtered HTML input format. Here's the text:

<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.

http://drupal.org

The first filter is the URL filter. It will find the URL which we have in this text and make it into a proper anchor tag:

Before
http://drupal.org
After
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>

The second filter is the HTML Filter. The text contains two HTML tags that are not on the whitelist, namely <h1> and <br>. Thus, they will be stripped.

Before
<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.
After
The quick brown fox jumps over the lazy dog.
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.

Finally, the Line Break Converter gets its chance. It looks for line break characters (\n, \n\n, etc.) and replaces them either with <br /> or encloses blocks of text in <p>...</p> tags. The function responsible for this is quite cunning, and was inspired by code from WordPress (our debt of gratitude).

Before (as received from the HTML Filter)
The quick brown fox jumps over the lazy dog.
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.

<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
After
<p>The quick brown fox jumps over the lazy dog.<br />
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.</p>
<p><a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a></p>

So what went wrong here? Well, nothing, technically. But the output is unlikely to be what the user expected. First of all, the <h1> tag was stripped, which is a good thing because it is what the Drupal administrator wanted. Second, the place where the user wanted to make a line break using <br><br> is totally different than what the user might expect. After all, the final rendered HTML contains a <br /> tag, so why were the <br> tags from the user stripped out? And why weren't they replaced by something more intelligent by the Line Break Converter? The answer can be seen by looking at the the text as it gets passed from the HTML Filter to the Line Break Converter. Because <br> isn't on the whitelist of allowed tags, the HTML Filter takes them out, leaving the Line Break Converter no clues to follow concerning the user's wish for a line break between "swimming." and "Every good boy".

Changing the Order

Let's look at the example above and see what would happen if we change the order of the filters. This is done by clicking to Administer -> Site configuration -> Input formats -> (Filtered HTML) configure -> Rearrange. Here I've changed the order so that the HTML Filter comes after the Line Break Converter.

Input filters order changed

As in the first example, the first filter is the URL Filter, so the output from that will be the same. The second filter, though, is now the Line Break Converter. Here is what happens to our text coming from the URL Filter and going into the Line Break Converter:

Before (as received from the URL Filter)
<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.

<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
After
<h1>The quick brown fox jumps over the lazy dog.</h1>
<p>King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.</p>

<p><a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a></p>

Interesting to note is that no paragraph tag was placed on "The quick brown fox". This is because <h1> elements are block level elements (which get their own line breaks in rendered HTML), so the Line Break Converter ignores them. Also interesting is that we have many instances of <p> and <br> tags, even though we're about to go into the HTML Filter which is configured to strip those tags out. Here is the final output with the new ordering of the filters (line breaks added for readability):

The quick brown fox jumps over the lazy dog. King Phillip came over
from <em>Germany</em> swimming.Every good boy deserves fudge.
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>

What a mess =)

What's the real solution in this case? Well, the original order of filters worked better, so consider leaving it URL Filter -> HTML Filter -> Line Break Converter. One way to fix the problem would be to configure the HTML Filter to allow <br> and <p> tags. This gives HTML savvy users control over paragraph formatting. The other solution would be to submit a patch against the HTML filter (see filter.module) to have it replace <br> tags with \n line break characters so that the Line Break Converter will pick up on them. Guess which solution is easier :P

Input Formats and User Roles

Not all Drupal users are created equal. Some are anonymous users, some are authenticated users, and some have other user roles that allow them to have greater privileges than normal authenticated users. Furthermore, one Drupal user on every site is the super-user (user #1) who can do anything. The privilege of using an input format can be assigned to users on a per-role basis. This is an important mechanism that exists for allowing some trusted users to have access to some filters while denying this access to less trusted users.

Look at the screen Administer -> Site configuration -> Input formats. It lists all of the input formats that have been established. The default Drupal installation comes with three, as noted above. On any Drupal site, one input format has to be designated as the default input format. This is indicated by the radio button in the Default column. To guarantee the presence of at least one input format, the default format cannot be deleted. The others can be, however, and you might consider deleting any input formats (such as the PHP code format) that you don't plan on using.

Drupal default input format cannot be deleted

On the configuration screen for an input format, you will see a listing of all the roles for users on your site. For any input format besides the default you can specify which user roles are privileged to use that input format. The default input format is automatically available to all users in all roles on your site and this cannot be changed.

Filters and Security

Filters and security go hand-in-hand. Without filters, there would be no security for your site as malicious attackers would have free reign in using scripts to deface your site, subject your users to phishing scams, and steal important data such as passwords.

The heart of the security offered by filters comes from from the HTML Filter and the calls it makes to filter_xss and check_plain. These are the functions that Drupal uses to prevent attacks based on user input. For this reason, all of your user submitted output should be run through the HTML Filter. It is tempting to ignore this advice, especially if you are having troubles getting the configuration settings just right for your purposes. Don't ignore this advice. You may end up sorry.

Also worth reiterating is the fact that the PHP Evaluator filter poses an extreme risk if it can be used by anyone but highly trusted, PHP-competent site administrators. Most sites will be better off deleting the PHP code input format and not extending use of the PHP Evaluator filter to anyone.

Finally, it should be obvious that the Full HTML input format, which does not use the HTML Filter, is insecure and should be offered only to those users who can be trusted not to ruin your site. Most sites will be better off deleting this input format.

Many More Filters Available

The fun with filters is that modules can offer their own filters. The number of filters available is large, and I can't possibly cover them all, but you can get a feel for the possibilities by looking at the Filters and Editors category of Drupal modules on Drupal.org. Here are some interesting modules that offer filters:

ModuleDescription
Amazon FilterProvides a text filter to insert amazon book title/links, cover images, and themable formatted information using a simple [amazon {title|cover|info} ] tag.
BBCode Allows users to specify markup using BBCode.
Code FilterRenders syntax-highlighted PHP code. This module is used on Drupal.org.
DruTex A LaTex renderer that can, among other things, render mathematical formulas and generate PDFs of nodes.
HTML CorrectorCorrects corrupt HTML in nodes and comments. This is useful for cases where users forget to close tags, or for where the teaser view breaks the HTML.
Inline FilterUses a [inline:filename.jpg] syntax to allow for inline images or file links.
Markdown with SmartyPantsOne of my favorites, this allows simple ASCII formatting to be turned into HTML. For example, ##This would be a h2, and *this would be emphasized*. This module is in use on http://groups.drupal.org.
Paging FilterBreak long pages into smaller ones by means of a "page" tag.
Pirate FilterTurns English into Pirate speak.
Smileys Parses smiley character combinations and replaces them with inline smiley images.
Word FilterFilters a list of restricted words.

Conclusion

The filtering of output is an essential part of web publishing and one of Drupal's great strengths. Understanding the difference between input formats and filters, and how to configure each, is an essential step in becoming a great Drupal site administrator. Drupal modules can implement filters to make your site powerful and fun.

Robert Douglass

Comments

NikLP

Textile

Textile seems to be often overlooked, as it forms part of the core of a different CMS/Framework altogether: TextPattern (TXP).

TXP is a great little CMS, useful for situations where Drupal would be overkill. It's also pretty flexible, although it has a similarly steep initial learning curve! :)

Textile is the input text filter that is used on that CMS, along with a similar "convert linebreaks" one.

I don't think this little filter should be overlooked - it's really very powerful and not too hard to get to grips with - especially for sites where just basic formatting is required, i.e. where good use of CSS has already been applied.

It also produces valid XHTML 1.0 Strict HTML code, which is great - I would use this over a WYSIWYG editor any day, if I thought the users would cope with it. It seems that TinyMCE (also not mentioned...?) is going to stay for now, at least in client sites.

Reply

Anonymous

Can I print this? (Can, not may.. :-))

Would you have a theme Ninja who could write a media="print" stylesheet for your articles so that us old fogies who read things on paper can print a copy which doesn't get all mangled?

Thanks.

Reply

jeff

Stay tuned

An entirely new version of this site is in the works -- include a sexy print style sheet for the old fogies! :-)

Reply

Zach Harkey

Caveat

you might consider deleting any input formats (such as the PHP code format) that you don't plan on using.

I've found it can be very dangerous to delete any of the default input formats (something I used to do routinely with every site).

To give just one example, the Views module wouldn't let me add/edit/enable etc. any of the module-provided default views (e.g. taxonomy_term, frontpage, etc.). It would just give me some vague error message about my having made an "illegal choice" error.

I finally tracked it down to my having deleted the default input formats. Once restored, everything worked fine, and the whole learning experience only cost me a few days wages — so beware.

Reply

Jonahan

Really cool. But any clues

Really cool.

But any clues as to how to programmatically apply a filter to a chunk of text? Like form inside of a custom module?

Reply

Anonymous

When I use this code I get

When I use this code I get "<p><strong>Test</strong></p>" out, I'm trying to write a disclaimer on a page and have the first sentence in bold text. Seems like a simple task, but I've spent hours surfing for an answer :o( Any suggestions appreciated, thanks.

  if ($disclaimer) {
    $form['signup_form_data']['disclaimer'] = array(
    '#type' => 'textarea',
    '#title' => check_plain($type->body_label),
    '#attributes' => array('readonly' => 'readonly'),
    '#rows' => 20,
    '#default_value' => check_markup('<strong>Test</strong>','Full HTML'),                      
    '#required' => false
    );
Reply

robert

You need to use the integer

You need to use the integer id for Full Html - ie the format column from the filter_formats table:

mysql> select * from filter_formats;
+--------+---------------+-------+-------+
| format | name          | roles | cache |
+--------+---------------+-------+-------+
|      1 | Filtered HTML | ,1,2, |     1 |
|      2 | PHP code      |       |     0 |
|      3 | Full HTML     |       |     1 |
+--------+---------------+-------+-------+

So use this:

<?php
'#default_value' => check_markup('<strong>Test</strong>', 3),   
?>
Reply

mukilan

Strip-Disallowed-tags Failure

Thanks for this great article.

I use TinyMCE and I am not sure I have made some mistake somewhere. BUt, in all my input formats, I am using "Strip disallowed tags", eventhough the end-html shows "escaped HTML tags".

For example.,

Input:

- Continue on this road till you get to Tiruvallur in about 1 hour.

- This road ends in Tiruvallur where you need to take a left into SH57..

Output:

- Continue on this road till you get to Tiruvallur in about 1 hour. < br / > < br / >- This road ends in Tiruvallur where you need to take a left into SH57.. < br / > < br / >

Following are the settings, I have used.

I am using FilteredHTML format and have the following options switched on.

HTMLFilter, InlineImages, Line Break Converter & URL Filter.

In HTML Filter section, I have selected "Strip disallowed tags" and allows the following tags " a em strong u cite code ul ol li dl dt dd img p sub sup strike blockquote hr br".

The filter order is URL Filter --> Line Break Filter --> HTML Filter and Inline Images.

Actually, I want the output to be the same as inputted. Am I required to do some more setup or am I missing something.

Thanks in Advance

Reply

moeed

How to have text not go thru a filter

For example, when you create a content type, you fill out the description text. I want to put a link in the description but it gets automatically stripped out. Any way around that?

Reply

Doug

There doesn't seem to be a

There doesn't seem to be a way have anonymous users be able to only use plain text input and registered users default to filtered html. Because the default input filter is the default for everyone, registered or anonymous.

Drupal needs to have an option to set which filter is the default for each role.

Reply

Andy Chase

Killer module

Thanks for clueing me in to the existince of Better Formats, dragonwize - not being able to set the default input format per role has been a longtime peeve of mine!

Reply

Anonymous

drupal input format settings gone

Hi, I have racked my brain to work this out.
When editing a blog or page, the wysiwyg and the (usually click drop down) form to set which filter (full HTM, PHP, etc ) is not clickable.

I have upgraded - everything works fine - now at 6.6 - no other issues except this.

Could you imagine why?

Thanks in advance.

ME

Reply

Andrew Mellenger

I changed my filters, help!

I have a custom input filter on my site for creating a slideshow, eg: http://www.vanmag.com/Restaurants/101_Things_to_Taste_Before_You_Die (click on any of the bulleted items)

I had to modify it because the size of the popup changed, which is in the link and uses thickbox. I changed the custom filter but the only way to make the change stick is to edit every page and re-save it. Is there a script i could run that would do this? or is there a way to clear out the input filter and have it be processed again on page load?

Any suggestions would be greatly appreciated.

thanks,

Andrew

Reply

Mike Gifford

Performance Implications

I was asked recently about the performance implications of adding a module like HTML Purifier to clean up user input. It's great to have a valid XHTML Strict theme, but not if it's trivial for a user to bust it by not closing a div tag.

In anycase, since this is a great space to talk about filters it would be nice to know what the resource implications are of layering yet another filter onto the user input.

Reply

Mike Gifford

Performance Implications

I was asked recently about the performance implications of adding a module like HTML Purifier to clean up user input. It's great to have a valid XHTML Strict theme, but not if it's trivial for a user to bust it by not closing a div tag.

In anycase, since this is a great space to talk about filters it would be nice to know what the resource implications are of layering yet another filter onto the user input.

Reply

Khaled

Hi, I'm facing a big problem

Hi, I'm facing a big problem with inline style, I use Filtered HTML, but this filter doesn't show inline style (e.g. <p style="color: red"> it just output <p>). How can I make it possible to show inline style?

Reply

Ashford

re: to Khaled on How to add indent to Filtered HTML

Go to the settings: admin/settings/filters

Click on the link, 'configure', next to Filtered HTML.
Next screen. Click the tab for Configure.

Scroll down to the text box for 'Allowed HTML Tags'.
Following the examples already there, add <blockquote>.

Reply

Anonymous

Try htmLawed module

The htmlawed module is a good alternative to Drupal's HTML filter.

"...enables the use of the htmLawed (X)HTML filter/purifier PHP script as an input filter with input format-, content (node) type- and body/comment/teaser-specific configurations ... The module also provides an option to filter submitted content before it is stored in the database..."

Reply

Anonymous

Hi. I'm sorry my english. I

Hi.

I'm sorry my english. I have a question about the input formats. I installed the flashviewer module, it is a gallery. When i am going to create the categorys don't show the input format option and i can't put borders to images and others attributes of the labels.

Thank you for your attention.

Reegards.

Reply

Keith Lee

Great Article

I've been recently diving into the world of Drupal and learning though books and online articles. I just want to say thanks for this article. It was thorough in its explanation and just want I needed to complete my module.

Reply