Regex match for spaces outside of HTML tags

The Scenario

Let’s say you are truncating a blog post to produce an excerpt for multiple posts in an index view. In Ruby on Rails, we can use the truncate() method like:

truncate(post.content, length: 300, separator: ' ', escape: false);

You might already notice a pitfall with this implementation. By setting escape to false, we are allowing HTML to be rendered (yes, this has other implications you need to be aware of). We’ve told the truncate() method to break on a space after 300 characters. But what happens if that space is inside an HTML tag? You could end up with unbalanced tags that end up messing up your page layout.

Let’s fix that.

The Solution

We need to conjure up a regular expression that won’t match on a space inside of tags. That includes both inside the tags themselves, as well as between two balanced tags like: <a href="#">Don't break in here</a>. Below is an expression that does just that.

/ (?![^<]*>|[^<>]*</)/

Let’s walk through it.

The / at the beginning and end are how we specify a regular expression in Ruby.
We start by matching all spaces with a space character.
(?!) starts a negative lookahead. This will negate the space match if it finds the subsequent pattern we define.
[^<]*> a pattern that eliminates matches inside of tags.
| or
[^<>]*</ a pattern to eliminate matches between two balanced tags.

We can now tell the truncate() method to use our pattern to safely truncate on a space, while avoiding some unclosed-tag mayhem.

truncate(post.content, length: 300, separator: / (?![^<]*>|[^<>]*</)/, escape: false);

I hope you can find that useful!

Written by Matt Haliski

The First of His Name, Consumer of Tacos, Operator of Computers, Mower of Grass, Father of the Unsleeper, King of Bad Function Names, Feeder of AI Overlords.

Follow: